Re: [nznog] Whole of Domain Web Harvest 2014 - Robots.txt

22 Sep 2014

      On 23/09/14 12:07, Jay Gattuso wrote:
...
If we choose the wrong option, there is a risk of a, reduced crawl
coverage, and frustrated web content operators.
Or of going in circles fetching the same content over and over again 
from multiple variations on the same URL (various dynamic generated URLs 
have had this problem in the past -- some may still have this problem -- 
and robots.txt is one quick way to avoid that rabbit hole).

There is a fourth option that I'd suggest you at least consider:
4) define a specific robots.txt tag for the "DIA WOD" crawl, publicise 
it widely at least a month in advance of the actual crawl, and respect 
_those_ specific robots.txt line(s) (but otherwise, eg, at least go with 
2 where there's no "directed at DIA WOD" rule, in the interests of 
completeness).

Since the crawl isn't being done by stealth (unlike the early ones) 
there is an opportunity to _work_with_ the NZ operators and come up with 
a cooperative solution.  Rather than just trying to guess from what is 
there now.  Eg, allow operators to "opt out", but require they do so 
_specifically_.  (Cf, blocking the crawler IPs at their border routers, 
which is also a form of opt out, but much less granular.)

Ewen

Re: [nznog] Whole of Domain Web Harvest 2014 - Robots.txt

Ewen McNeill