Hi Noggers,
We are looking at undertaking another Whole of Domain (WOD) web harvest during this financial year. (I will be in touch with you with the specifics once we have
some of the timings formalised with the crawl contractor, the Internet Archive.
I wanted to ask for your views on adherence to robots.txt.
We have three options that are hypothetically available to us, each with various merits.
1) Strictly adhere - will miss some artefacts required to replay a site correctly
2) Adhere but follow embeds - (allows us to get linked images, css, some json etc)
3) Ignore - most fulsome crawl.
On the last WOD (in 2013) we used option (2), and I wanted to test if this was still the right option to choose.
We aim to capture the most complete snapshot of the NZ web possible so it is available to future generations. This Robots issues directly affects the completeness
of the crawl we can undertake.
If we choose the wrong option, there is a risk of a, reduced crawl coverage, and frustrated web content operators.
I'm really keen to make sure we are able to capture the most complete WOD we can, and seek your opinions on the above, on or off list.
Many thanks,
Jay
Jay Gattuso | Digital Preservation Analyst | Preservation, Research and
Consultancy
National Library of New Zealand | Te Puna M�tauranga o Aotearoa
PO Box 1467 Wellington 6140 New Zealand | +64 (0)4 474 3064