Jasper Bryant-Greene wrote:
On Tue, 21 Oct 2008 14:29:44 +1400, "Alex Hague"
wrote: I'm from a small (but growing) Kiwi website that has community generated content etc. I think that there is an additional point that has been missed so far: the web is no longer static. The uncertainty principal begins to apply - by them crawling entire sites they may begin to interact with the content on the sites inadvertently.
As far as I'm aware, the big crawlers don't perform POST, PUT, DELETE queries.
Seeing as HTTP requires GET to be idempotent, and not take any action other than retrieval, crawlers won't "interact" with well-designed websites if by "interact" you mean "change stuff".
Web crawlers can wander around grabbing difficult to generate dynamic content repeatedly. This content may be generated from queries from a database and require quite a bit of cpu and/or memory to convert to a form that's usable in a browser. While a user grabs a few pages of this slow dynamic content to answer whatever question they may have, because there may be potentially infinite ways of presenting this data, the crawler may inadvertantly start using up a lot of resources in the form of RAM and CPU. Often caching is applied to queries, since multiple people generally end up making similar, or even the exact same query. However a bot generating millions of these queries can quickly fill up caches forcing expiry of "old" (but not yet stale) entries from the cache. (Or if the code wasn't written sufficiently well, filling up the disk the cache is stored on). Usually you protect such content with robots.txt, and robots meta headers telling robots to stay away. While my sites are currently setup I believe can handle this kind of load, in the past this has been a problem. Keeping robots out of an area may be there for the robots protection, not to try and "hide" the content. (Although I agree that people making sites that can change state with a GET are asking for trouble.)