Re: [nznog] NLNZHarvester2008

20 Oct 2008

      On Tue, 21 Oct 2008 14:29:44 +1400, "Alex Hague" <mohawkalex(a)gmail.com>
wrote:
...
I'm from a small (but growing) Kiwi website that has community generated
content etc. I think that there is an additional point that has been
missed
so far: the web is no longer static. The uncertainty principal begins to
apply - by them crawling entire sites they may begin to interact with the
content on the sites inadvertently.
As far as I'm aware, the big crawlers don't perform POST, PUT, DELETE
queries.

Seeing as HTTP requires GET to be idempotent, and not take any action other
than retrieval, crawlers won't "interact" with well-designed websites if by
"interact" you mean "change stuff".
...
For example there can be links to flag content as inappropriate. We use
robots.txt to prevent crawlers from hitting this kind of link as well as
indexing our APIs (which return XML | JSON) and are no use to a crawler
(but
which they seem to love indexing).
If the APIs return an appropriate Content-Type and the crawlers still
retrieve them, then the crawlers are either genuinely interested in
indexing the content retrieved by those APIs, or they're buggy and you
should report the issue.

Cheers,
-- 
Jasper Bryant-Greene
Network Engineer, Unleash

ddi: +64  3 978 1222
mob: +64 21 129 9458