National Library of New Zealand Web Harvest 2013

Hi All, This email is to inform you of the NLNZ intent to undertake its 3rd Whole of Domain Web Harvest. The planned dates for this exercise are the 4th to the 22nd of February 2013. The harvest will be undertaken by the Internet Archivehttp://archive.org/index.php on our behalf, More details can be found on the web harvest website: http://natlib.govt.nz/publishers-and-authors/web-harvesting/2013-nz-web-harv... If anyone has any specific concerns, requests or questions, please do not hesitate to contact me directly: jay.gattuso(a)dia.govt.nzmailto:jay.gattuso(a)dia.govt.nz or via the project email address: Web.Archive(a)dia.govt.nzmailto:Web.Archive(a)dia.govt.nz?Subject=About%20the%202013%20Web%20Harvest If anyone wants to discuss this further on the mailing list I am happy to do so. King regards, Jay Jay Gattuso | Digital Preservation Analyst | Preservation, Research and Consultancy National Library of New Zealand | Te Puna Mātauranga o Aotearoa PO Box 1467 Wellington 6140 New Zealand | +64 (0)4 474 3064 jay.gattuso(a)dia.govt.nzmailto:jay.gattuso(a)natlib.govt.nz

Hi All, Can anyone tell me where this harvest is being done from this year and what IP range to block so it doesn't troll though my machines? D On 13/12/2012 1:22 p.m., Jay Gattuso wrote:
-- Don Gould 31 Acheson Ave Mairehau Christchurch, New Zealand Ph: + 64 3 348 7235 Mobile: + 64 21 114 0699

The linked document states that :
Robots.txt files with a rule that refers specifically to the *
NLNZ_IAHarvester2013* user agent will be strictly obeyed.
Which would seem to be a much better way to handle this than an IP block...
Scott
On Fri, Jan 18, 2013 at 10:28 PM, Don Gould

Hi Scott, I was aware from previous years that Robots.txt rules can be put in place. My problem is that I have an extensive number of domains which would take me some time to edit in rules for every one. In years past I understand that the harvest is done from servers overseas. In the past that hasn't been an issue for me as my content has been over seas. However in the past 12 months I've been making moves to improve local performance by putting content locally. The cost of that is that requests for content from O/S are way more expensive than before. I also have much more content that I used to. So, simply blocking the IPs in a couple of routers makes much more sense for me. D On 19/01/2013 7:45 p.m., Scott Howard wrote:
-- Don Gould 31 Acheson Ave Mairehau Christchurch, New Zealand Ph: + 64 3 348 7235 Mobile: + 64 21 114 0699

So, simply blocking the IPs in a couple of routers makes much more sense for me.
It's all coming from the following for us (across lots of domains) 207.241.226.39 207.241.226.40 ie: 207.241.226.40 - - [19/Jan/2013:15:53:04 +1300] "GET /index.html HTTP/1.0" 200 3653 "-" "Mozilla/5.0 (compatible; NLNZ_IAHarvester2013 +http://natlib.govt.nz/about-us/current-initiatives/web-harvest-2012)" 1275 207.241.226.39 - - [19/Jan/2013:19:27:40 +1300] "GET /robots.txt HTTP/1.0" 200 19 "-" "Mozilla/5.0 (compatible; NLNZ_IAHarvester2013 +http://natlib.govt.nz/about-us/current-initiatives/web-harvest-2012)" 363 -- Jean-Francois Pirus | Technical Manager francois(a)clearfield.com | Mob +64 21 640 779 | DDI +64 9 282 3401 Clearfield Software Ltd | Ph +64 9 358 2081 | www.clearfield.com

Can you please give us the calculations that show that writing a one line shell script and serving up robots a handful of times is more expensive than spending your time writing emails objecting to the same? I'll be happy to sell you bandwidth for slightly cheaper to help you cut costs, based on your calculations of course. I'll even give you a ride to the nznog venue from the airport in the Lamborghini you'll have bought me by mid next week. Also please provide similar calculations taking in to account each further reply so I can adjust my prices accordingly. On Saturday, January 19, 2013, Don Gould wrote:
participants (5)
-
Don Gould
-
Jay Gattuso
-
Jean-Francois Pirus
-
Nathan Ward
-
Scott Howard