On 23/09/14 12:07, Jay Gattuso wrote:
If we choose the wrong option, there is a risk of a, reduced crawl coverage, and frustrated web content operators.
Or of going in circles fetching the same content over and over again from multiple variations on the same URL (various dynamic generated URLs have had this problem in the past -- some may still have this problem -- and robots.txt is one quick way to avoid that rabbit hole). There is a fourth option that I'd suggest you at least consider: 4) define a specific robots.txt tag for the "DIA WOD" crawl, publicise it widely at least a month in advance of the actual crawl, and respect _those_ specific robots.txt line(s) (but otherwise, eg, at least go with 2 where there's no "directed at DIA WOD" rule, in the interests of completeness). Since the crawl isn't being done by stealth (unlike the early ones) there is an opportunity to _work_with_ the NZ operators and come up with a cooperative solution. Rather than just trying to guess from what is there now. Eg, allow operators to "opt out", but require they do so _specifically_. (Cf, blocking the crawler IPs at their border routers, which is also a form of opt out, but much less granular.) Ewen