On Tue, Oct 21, 2008 at 1:39 PM, Dean Pemberton <nznog@deanpemberton.com> wrote:

So make sure your views have been expressed on the list by someone.

I'm from a small (but growing) Kiwi website that has community generated content etc. I think that there is an additional point that has been missed so far: the web is no longer static. The uncertainty principal begins to apply - by them crawling entire sites they may begin to interact with the content on the sites inadvertently.

For example there can be links to flag content as inappropriate. We use robots.txt to prevent crawlers from hitting this kind of link as well as indexing our APIs (which return XML | JSON) and are no use to a crawler (but which they seem to love indexing).

The issue for us isn't that they are indexing our site, it is that they are disobeying robots.txt. Even this in itself wouldn't be a problem if they provided a heads up in future and compromised by following robots.txt entries that targeted their user agent.

We lurk on this list specifically to be aware of this kind of activity, it is incredibly arrogant of them to undertake such a massive project knowing that they will cause headaches (by not following robots.txt) without consulting such an obvious place as NZNOG.

Cheers,

Alex