Re: [nznog] NLNZHarvester2008

20 Oct 2008

      Jasper Bryant-Greene wrote:
...
On Tue, 21 Oct 2008 14:29:44 +1400, "Alex Hague" <mohawkalex(a)gmail.com>
wrote:
...
I'm from a small (but growing) Kiwi website that has community generated
content etc. I think that there is an additional point that has been
missed
so far: the web is no longer static. The uncertainty principal begins to
apply - by them crawling entire sites they may begin to interact with the
content on the sites inadvertently.
As far as I'm aware, the big crawlers don't perform POST, PUT, DELETE
queries.
Seeing as HTTP requires GET to be idempotent, and not take any action other
than retrieval, crawlers won't "interact" with well-designed websites if by
"interact" you mean "change stuff".
Web crawlers can wander around grabbing difficult to generate dynamic 
content repeatedly.  This content may be generated from queries from a 
database and require quite a bit of cpu and/or memory to convert to a 
form that's usable in a browser.  While a user grabs a few pages of this 
slow dynamic content to answer whatever question they may have, because 
there may be potentially infinite ways of presenting this data, the 
crawler may inadvertantly start using up a lot of resources in the form 
of RAM and CPU.

Often caching is applied to queries, since multiple people generally end 
up making similar, or even the exact same query.  However a bot 
generating millions of these queries can quickly fill up caches forcing 
expiry of "old" (but not yet stale) entries from the cache.  (Or if the 
code wasn't written sufficiently well, filling up the disk the cache is 
stored on).

Usually you protect such content with robots.txt, and robots meta 
headers telling robots to stay away.   While my sites are currently 
setup I believe can handle this kind of load, in the past this has been 
a problem.   Keeping robots out of an area may be there for the robots 
protection, not to try and "hide" the content.

(Although I agree that people making sites that can change state with a 
GET are asking for trouble.)

Re: [nznog] NLNZHarvester2008

Perry Lorier