Hi all: Thanks to those of you who have contacted me after reading this thread. If you have not read our FAQ, we have addressed several of the issues that were raised here in that document: http://www.natlib.govt.nz/about-us/news/20-october-2008-web-harvest-faqs There are also a few new issues raised here that I’ll be adding to the FAQ shortly, particularly around notifying webmasters in advance and harvesting from an international service, both of which we could have handled better: 1. Several people have asked about notification, and some you have made practical and workable suggestions about how we can handle this better next time. In the current crawl, we could not see a good way to do so without effectively becoming spammers. In hindsight we could have communicated better with webmasters. When we decide to run the harvest again, we will make more of an effort to publicise the harvest in mailing lists and groups frequented by webmasters (such as this one). 2. Others ask why we are harvesting from the USA and not New Zealand? We have contracted the Internet Archive to conduct the harvest because they are the single most experienced provider of large-scale crawling services in the world. An unfortunate offshoot of this is that their servers are based in the USA. We hope that after observing the experts at work we'll be able to manage future harvests from within New Zealand. At the very least we have learned that we should locate some of the harvest servers in New Zealand. 3. Bmannign asks whether this is a recurring or once-off harvest? While we have not planned any further harvests at this time, it is likely that domain harvests will become a feature of the Library’s overall web harvesting programme. Analysis of the current harvest and research into various access issues will help determine frequency. Finally, As we have seen in other forums, a lot of you object to our robots.txt policy. Again, I can only say we understand your point of view, but in the context of this crawl we believe it is important that we harvest as much of the domain as possible, in order to preserve the web as it is today for the New Zealanders and researchers of the future. Simon Lyall illustrates our dilemma exactly above: people use robots.txt for different reasons, and in a perfect world (or possibly even an imperfect one where robots.txt developed into a standard) we would know why each robots.txt rule was written, and crawl more appropriately. Thanks, Gordon -- Gordon Paynter Technical Analyst National Digital Library National Library of New Zealand