On 21/10/2008, at 4:30 PM, Perry Lorier wrote:
Often caching is applied to queries, since multiple people generally end up making similar, or even the exact same query. However a bot generating millions of these queries can quickly fill up caches forcing expiry of "old" (but not yet stale) entries from the cache. (Or if the code wasn't written sufficiently well, filling up the disk the cache is stored on).
If you read the FAQ from the National Library, it does say: "In practical terms, this means webmasters can expect the harvester to work in bursts, taking 100 URLs from each website before moving on the next. Eventually the harvester will cycle back around to collect the next 100 URLs from the site. The exceptions to this are Government, Research, and Maori sites (.govt.nz, .ac.nz, .cri.nz and .maori.nz) where we harvest 500 URLs at a time." Which means you can expect to only see 100 pages requested at a time then some time for your 286 to recover before the next 100 requests comes along. This should resolve any worries about the crawler crapping all over the performance of your site(s). Cheers, Patrick