Quote from their site: " If you ignore robots.txt, what's to stop me blocking your crawler's IP address? Nothing. Some webmasters have taken this action, and we're sorry they felt they had to go to these lengths. We are running this harvest with good intentions, and ask that if you have blocked us, you reconsider - for example by allowing the harvester to access your site on the condition that it honours robots.txt. We'd much prefer this outcome to getting nothing from your websites at all. Please remember that this project is about trying to ensure that as much as possible of the social history being enacted on the web today is available to researchers and all New Zealanders in the future. If we don't capture it now, we may not have the chance later. " If you wanted to, you could simply ask them to obey your robots.txt. Brad Pearpoint -----Original Message----- From: Craig Whitmore [mailto:lennon(a)orcon.net.nz] Sent: Tuesday, 21 October 2008 10:51 a.m. To: Michael Jager Cc: nznog(a)list.waikato.ac.nz Subject: Re: [nznog] NLNZHarvester2008
As a reasonably large provider of NZ-based hosting services, we've certainly noticed NatLib's mirroring activity as a non-negligible increase on our international bandwidth utilisation over the last few weeks. Yes, there are a million and one things that'll have an impact on your bandwidth consumption. However, I echo Murray's frustration that a New Zealand organisation either wouldn't think about this, or have completely ignored it.
Also there is the problem of "NZ Only Content" which the international spiders will not see/get such as the tvnzondemand and the citylink NZ Anycast content (+ others I am sure) They also note that they will "stop" at a certain point so some large websites will not get mirrored 100% .. So whats the point of that? you either want all or nothing. I am 100% there is some content people don't want to be spidered. a quick .htaccess to deny their spider. (I think this will work) RewriteEngine On Options +FollowSymLinks RewriteCond %{HTTP_USER_AGENT} ^NLNZHarvester2008 RewriteRule ^.* - [F,L] (And the spider will get a 403 - hopefully) Thanks Craig _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog