New subject: NLNZHarvester2008

20 Oct 2008

      Hi all:

Thanks to those of you who have contacted me after reading this thread.
If you have not read our FAQ, we have addressed several of the issues
that were raised here in that document:
http://www.natlib.govt.nz/about-us/news/20-october-2008-web-harvest-faqs

There are also a few new issues raised here that I’ll be adding to
the FAQ shortly, particularly around notifying webmasters in advance and
harvesting from an international service, both of which we could have
handled better:

1. Several people have asked about notification, and some you have made
practical and workable suggestions about how we can handle this better
next time.

In the current crawl, we could not see a good way to do so without
effectively becoming spammers. In hindsight we could have communicated
better with webmasters. When we decide to run the harvest again, we will
make more of an effort to publicise the harvest in mailing lists and
groups frequented by webmasters (such as this one).

2. Others ask why we are harvesting from the USA and not New Zealand?

We have contracted the Internet Archive to conduct the harvest because
they are the single most experienced provider of large-scale crawling
services in the world.

An unfortunate offshoot of this is that their servers are based in the
USA.

We hope that after observing the experts at work we'll be able to
manage future harvests from within New Zealand. At the very least we
have learned that we should locate some of the harvest servers in New
Zealand.

3. Bmannign asks whether this is a recurring or once-off harvest?

While we have not planned any further harvests at this time, it is
likely that domain harvests will become a feature of the Library’s
overall web harvesting programme. Analysis of the current harvest and
research into various access issues will help determine frequency.

Finally, As we have seen in other forums, a lot of you object to our
robots.txt policy. Again, I can only say we understand your point of
view, but in the context of this crawl we believe it is important that
we harvest as much of the domain as possible, in order to preserve the
web as it is today for the New Zealanders and researchers of the future.
 Simon Lyall illustrates our dilemma exactly above: people use
robots.txt for different reasons, and in a perfect world (or possibly
even an imperfect one where robots.txt developed into a standard) we
would know why each robots.txt rule was written, and crawl more
appropriately. 

Thanks,
Gordon

--
Gordon Paynter
Technical Analyst
National Digital Library
National Library of New Zealand

Re: [nznog] NLNZHarvester2008

Gordon Paynter

Juha Saarinen

Nicholas Lee

Tony Wicks

Gordon Paynter

Don Gould

Nathan Ward

Philip Seccombe

TreeNet Admin

Philip Seccombe

Russell Sharpe

Simon Lyall

Russell Sharpe

Russell Sharpe

Phonenet

Richard Hector

Patrick Jordan-Smith

Dean Pemberton

Leon Strong

Matthew Poole

Donald Neal

Brislen, Paul, VF-NZ

Mark Harris

Tony Wicks

John Russell

Andy Linton

tags

participants (21)