On 20/10/2008, at 4:31 PM, Murray Fox wrote:
Hi All,
This was news to me ('till international traffic on some of my sites began to climb); The National Library is this month ripping the entire NZ Internet ..... from an international host ..... blatantly ignoring robots.txt
http://www.natlib.govt.nz/about-us/current-initiatives/web- harvest-2008
International traffic in NZ often costs end customers real money (especially in colo situations). A site I'm involved with, entirely NZ oriented content, served up 2GB to these guys in one day (that's half a normal month of international).
Nevermind the sovereignty question; presumably the goal is that by the end of Oct the entire NZ Internet will be mirrored at nlnzdata000.us.archive.org :-/ apologies
[quote] The harvested web pages will be stored at the Library, and will eventually be made publicly accessible. [/quote] Presumably it's cheaper for them to access a large amount of NZ content from a server in the US, and/or the archive.org guys are doing the hard yards for them.
Anyone else raging about this? Ignoring the robots.txt issue, why couldn't this have been done over national links? Or damn, at the very least they could have pinged off an email to webmaster(a)target.co.nz ahead of ripping the site to let admins get prepared?
I haven't seen it hit any of my content yet.. perhaps because it's on .com. It'll be interesting, I've got many many GB of data (forums, etc.) But, Google etc. walk my content fairly regularly, so I'm not that concerned. -- Nathan Ward