I'm not sure there's any excuse for ignoring robots.txt - it's just plain rude. A webmaster should retain the ability to influence whether their content is picked up or not.
With that in mind I suspect (2) is the right answer.

Mark.

On 23/09/2014 12:07 p.m., Jay Gattuso wrote:

Hi Noggers,

We are looking at undertaking another Whole of Domain (WOD) web harvest during this financial year. (I will be in touch with you with the specifics once we have some of the timings formalised with the crawl contractor, the Internet Archive.

I wanted to ask for your views on adherence to robots.txt.

We have three options that are hypothetically available to us, each with various merits.

1) Strictly adhere - will miss some artefacts required to replay a site correctly

2) Adhere but follow embeds - (allows us to get linked images, css, some json etc)

3) Ignore - most fulsome crawl.

On the last WOD (in 2013) we used option (2), and I wanted to test if this was still the right option to choose.

We aim to capture the most complete snapshot of the NZ web possible so it is available to future generations. This Robots issues directly affects the completeness of the crawl we can undertake.

If we choose the wrong option, there is a risk of a, reduced crawl coverage, and frustrated web content operators.

I'm really keen to make sure we are able to capture the most complete WOD we can, and seek your opinions on the above, on or off list.

Many thanks,

Jay

Jay Gattuso | Digital Preservation Analyst | Preservation, Research and Consultancy

National Library of New Zealand | Te Puna Mātauranga o Aotearoa

PO Box 1467 Wellington 6140 New Zealand | +64 (0)4 474 3064

jay.gattuso@dia.govt.nz
_______________________________________________
NZNOG mailing list
NZNOG@list.waikato.ac.nz
http://list.waikato.ac.nz/mailman/listinfo/nznog