Spider Settings

Category: Knowledge Base | Settings - Spider

You can, and should, optimise your spider. Firstly because the spider results are the foundation of all further tests – audits are only as good as their input. Secondly, testing irrelevant or duplicate content is wasteful – each page must first be loaded by our headless browser with time allowed for all assets to load (excluded assets are images, css and video files). This takes time and can be as long as 10 seconds per page. That does not sound a lot, but crawls typically contain thousands and tens of thousands of pages, so it soon adds up. Note you can audit the results of a spider as often as you wish – without any cost. However billing is based on your spider size. So spiders are important!

In this context optimisation means generating a representative spider size in the most efficient way. That means crawling your entire website content for the most complete picture, but doing so by only considering “uniques”. For example, avoid collecting multiple URLs for the same page e.g. /productX/?match=black and /productX/?match=white are usually the same page. Also, unless older content is valuable to you, avoid spidering it e.g. exclude /2009/, /archives etc. If you have an XML sitemap, use it to guide the spider.

Default Settings – Strip Query Parameters

The common optimisation technique is to exclude URLs that differ only by query parameter values. By doing this, any URL that differs only by the excluded parameter value is treated as a duplicate page – and the spider will ignore any duplicates. Following on from the above example, if you exclude the parameter match, then only 1 URL will be spidered no matter how many different match values you have on your website.

See also:  Are GA hits blocked when a spider runs?

By default the Strip query parameters field excludes the following from all crawls (i.e. you do not need to add them):

campaign
ccm_token
color
colour
dir
filter
filter-color
filter-colour
filter_color
filter_colour
height
mc_id
mkt_tok
notso
order
order-by
orderby
order_by
print
referrer
render
replytocom
size
sort
sort-by
sortby
sort_by
url
width
_ke

Tip – Strip All Query Parameters

If you wish to strip ALL query parameters from being crawled, you can use the Keep query parameters field instead. This avoids having to create an unmanageable strip parameter list. Instead, within the Keep query parameters field, simply use a dummyName here (the strip field should be empty). As it is a dummy name, the result is that all other parameters are excluded.



Leave a comment

Your email address will not be published. Required fields are marked *