[nznog] Whole of Domain (.nz) 2016 Web Harvest

22 Dec 2015

      Hi Noggers,

Further to previous emails (6th Nov 2015), we are all set to kick off the web harvest in the new year.

We’ll be starting the crawl on the 11th Jan 2016, and expect to be crawling for approximately 3 to 4 weeks.

The crawlers will be using the user agent string ““NLNZ_IAHarvester2016” so if you do need to set any specific rules for our crawlers this would be the identifier to use.

The robots.txt and Robots META tag exclusions on crawled sites will be obeyed with some minor exceptions:-
We’ll strictly obey all rules that relate to the user agent (apart from slash pages, which will be harvested regardless).
Facebook and some other curated social media sites will be harvested regardless of their Robots.txt.

The Crawl Notification (Notice to Webmasters) page is located at on the Library’s web site at:  http://natlib.govt.nz/publishers-and-authors/web-harvesting/domain-harvest.

If you have any questions or concerns about the harvest, please drop me a line. I’ll watch for email at various points over the Christmas break, and back in the office on the 4th Jan to address any questions/concerns.

Best,

Jay

Jay Gattuso | Digital Preservation Analyst | Preservation, Research and Consultancy
National Library of New Zealand | Te Puna Mātauranga o Aotearoa
PO Box 1467 Wellington 6140 New Zealand | +64 (0)4 474 3064
jay.gattuso(a)dia.govt.nz<mailto:jay.gattuso(a)natlib.govt.nz>