Re: [nznog] NLNZHarvester2008

20 Oct 2008

      On 21/10/2008, at 4:30 PM, Perry Lorier wrote:
...
Often caching is applied to queries, since multiple people generally  
end
up making similar, or even the exact same query.  However a bot
generating millions of these queries can quickly fill up caches  
forcing
expiry of "old" (but not yet stale) entries from the cache.  (Or if  
the
code wasn't written sufficiently well, filling up the disk the cache  
is
stored on).
If you read the FAQ from the National Library, it does say:

"In practical terms, this means webmasters can expect the harvester to  
work in bursts, taking 100 URLs from each website before moving on the  
next. Eventually the harvester will cycle back around to collect the  
next 100 URLs from the site. The exceptions to this are Government,  
Research, and Maori sites (.govt.nz, .ac.nz, .cri.nz and .maori.nz)  
where we harvest 500 URLs at a time."

Which means you can expect to only see 100 pages requested at a time  
then some time for your 286 to recover before the next 100 requests  
comes along.

This should resolve any worries about the crawler crapping all over  
the performance of your site(s).

Cheers,
Patrick

Re: [nznog] NLNZHarvester2008

Patrick Jordan-Smith