Hi Noggers, We are looking at undertaking another Whole of Domain (WOD) web harvest during this financial year. (I will be in touch with you with the specifics once we have some of the timings formalised with the crawl contractor, the Internet Archive. I wanted to ask for your views on adherence to robots.txt. We have three options that are hypothetically available to us, each with various merits. 1) Strictly adhere - will miss some artefacts required to replay a site correctly 2) Adhere but follow embeds - (allows us to get linked images, css, some json etc) 3) Ignore - most fulsome crawl. On the last WOD (in 2013) we used option (2), and I wanted to test if this was still the right option to choose. We aim to capture the most complete snapshot of the NZ web possible so it is available to future generations. This Robots issues directly affects the completeness of the crawl we can undertake. If we choose the wrong option, there is a risk of a, reduced crawl coverage, and frustrated web content operators. I'm really keen to make sure we are able to capture the most complete WOD we can, and seek your opinions on the above, on or off list. Many thanks, Jay Jay Gattuso | Digital Preservation Analyst | Preservation, Research and Consultancy National Library of New Zealand | Te Puna Mātauranga o Aotearoa PO Box 1467 Wellington 6140 New Zealand | +64 (0)4 474 3064 jay.gattuso(a)dia.govt.nzmailto:jay.gattuso(a)natlib.govt.nz
I'm not sure there's any excuse for ignoring robots.txt - it's just plain rude. A webmaster should retain the ability to influence whether their content is picked up or not. With that in mind I suspect (2) is the right answer. Mark. On 23/09/2014 12:07 p.m., Jay Gattuso wrote:
Hi Noggers,
We are looking at undertaking another Whole of Domain (WOD) web harvest during this financial year. (I will be in touch with you with the specifics once we have some of the timings formalised with the crawl contractor, the Internet Archive.
I wanted to ask for your views on adherence to robots.txt.
We have three options that are hypothetically available to us, each with various merits.
1) Strictly adhere - will miss some artefacts required to replay a site correctly
2) Adhere but follow embeds - (allows us to get linked images, css, some json etc)
3) Ignore - most fulsome crawl.
On the last WOD (in 2013) we used option (2), and I wanted to test if this was still the right option to choose.
We aim to capture the most complete snapshot of the NZ web possible so it is available to future generations. This Robots issues directly affects the completeness of the crawl we can undertake.
If we choose the wrong option, there is a risk of a, reduced crawl coverage, and frustrated web content operators.
I'm really keen to make sure we are able to capture the most complete WOD we can, and seek your opinions on the above, on or off list.
Many thanks,
Jay
*Jay Gattuso*| Digital Preservation Analyst | Preservation, Research and Consultancy
National Library of New Zealand | Te Puna Ma-tauranga o Aotearoa
PO Box 1467 Wellington 6140 New Zealand | +64 (0)4 474 3064
jay.gattuso(a)dia.govt.nz mailto:jay.gattuso(a)natlib.govt.nz
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
On Tue, 23 Sep 2014, Mark Foster wrote:
I'm not sure there's any excuse for ignoring robots.txt - it's just plain rude. A webmaster should retain the ability to influence whether their content is picked up or not. With that in mind I suspect (2) is the right answer.
Well the problem is that it is common for people to block all bots except Google (and perhaps a couple of others). AFAIK this crawl they did have a specific line to block them if you really wanted to. -- Simon Lyall | Very Busy | Web: http://www.simonlyall.com/ "To stay awake all night adds a day to your life" - Stilgar
On 23/09/14 12:07, Jay Gattuso wrote:
If we choose the wrong option, there is a risk of a, reduced crawl coverage, and frustrated web content operators.
Or of going in circles fetching the same content over and over again from multiple variations on the same URL (various dynamic generated URLs have had this problem in the past -- some may still have this problem -- and robots.txt is one quick way to avoid that rabbit hole). There is a fourth option that I'd suggest you at least consider: 4) define a specific robots.txt tag for the "DIA WOD" crawl, publicise it widely at least a month in advance of the actual crawl, and respect _those_ specific robots.txt line(s) (but otherwise, eg, at least go with 2 where there's no "directed at DIA WOD" rule, in the interests of completeness). Since the crawl isn't being done by stealth (unlike the early ones) there is an opportunity to _work_with_ the NZ operators and come up with a cooperative solution. Rather than just trying to guess from what is there now. Eg, allow operators to "opt out", but require they do so _specifically_. (Cf, blocking the crawler IPs at their border routers, which is also a form of opt out, but much less granular.) Ewen
The crawls haven't been 'done by stealth' for a number of hears now. As for engagement, these guys have been turning up to NZNOGs, posting to the mailing list and talking with operators about this since before NatLib was part of DIA. Anyone who is taken by surprise hasn't been awake =) Dean
Since the crawl isn't being done by stealth (unlike the early ones) there is an opportunity to _work_with_ the NZ operators and come up with a cooperative solution. Rather than just trying to guess from what is there now. Eg, allow operators to "opt out", but require they do so _specifically_. (Cf, blocking the crawler IPs at their border routers, which is also a form of opt out, but much less granular.)
+1 to that. Archiving everything seems like a good thing. Google and the NSA are quite good at it too. Anything you might miss, just check with the above. On 23/09/2014 2:48 p.m., Dean Pemberton wrote:
The crawls haven't been 'done by stealth' for a number of hears now. As for engagement, these guys have been turning up to NZNOGs, posting to the mailing list and talking with operators about this since before NatLib was part of DIA.
Anyone who is taken by surprise hasn't been awake =)
Dean
Since the crawl isn't being done by stealth (unlike the early ones) there is an opportunity to _work_with_ the NZ operators and come up with a cooperative solution. Rather than just trying to guess from what is there now. Eg, allow operators to "opt out", but require they do so _specifically_. (Cf, blocking the crawler IPs at their border routers, which is also a form of opt out, but much less granular.)
--- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com
participants (6)
-
Alan Maher
-
Dean Pemberton
-
Ewen McNeill
-
Jay Gattuso
-
Mark Foster
-
Simon Lyall