You can, and should, optimise your spider. Firstly because the spider results are the foundation of all further tests – audits are only as good as their input. Secondly, testing irrelevant or duplicate content is wasteful – each page must first be loaded by our headless browser with time allowed for all assets to load (excluded assets are images, css and video files). This takes time and can be as long as 10 seconds per page. That does not sound a lot, but crawls typically contain thousands and tens of thousands of pages, so it soon adds up. Note you can audit the results of a spider as often as you wish – without any cost. However billing is based on your spider size. So spiders are important!
In this context optimisation means generating a representative spider size in the most efficient way. That means crawling your entire website content for the most complete picture, but doing so by only considering “uniques”. For example, avoid collecting multiple URLs for the same page e.g. /productX/?match=black and /productX/?match=white are usually the same page. Also, unless older content is valuable to you, avoid spidering it e.g. exclude /2009/, /archives etc. If you have an XML sitemap, use it to guide the spider.
Default Settings: QUERY PARAMETERS – STRIP
The common optimisation technique is to exclude URLs that differ only by query parameter values. By doing this, any URL that differs only by the excluded parameter value is treated as a duplicate page – and the spider will ignore any duplicates. Following on from the above example, if you exclude the parameter match, then only 1 URL will be spidered no matter how many different match values you have on your website.
By default the QUERY PARAMETERS – STRIP field excludes the following from all crawls (i.e. you do not need to add them):
Tip – Strip All Query Parameters
If you wish to strip ALL query parameters from being crawled, you can use the QUERY PARAMETERS – KEEP field instead. This avoids having to create an unmanageable strip parameter list. Instead, within the QUERY PARAMETERS – KEEP field, simply use a dummyName here (the strip field should be empty). As it is a dummy name, the result is that all other parameters are excluded.