[nznog] NLNZHarvester2008 Summary

20 Oct 2008

      There seem to be a number of issues here.

To summarise the ones I can identify so far:   48 posts from 20/10/08
4:31 PM to 21/10/08 1:47 PM NZDT

International Bandwidth Usage:
The harvest was initiated from international sites and as such content
owners were forced to pay for international bandwidth to return
content.   National Library is working with its contractors to discuss
the possibility of a New Zealand based harvest site for any future use.

robots.txt:
The matter of crawling sites which contain robots.txt files was a hot
topic.  Subtopics around this were the fact that they would be honored
if the webmaster requested this.  It was not seen that this was
necessary, as they were there for a reason in the first place.  It was
also noted that robots.txt were used for different valid purposes. 

Scan IPs:
There was discussion around the IPs used to mount the scan.  These seem
to have been known to the group until logs were checked.  This is in
spite of the fact that they were provided by National Library on their
website. 

Lack of notification:
There seems to be a general feeling that more of an effort should have
been made to notify industry about this harvest.  The issues around
International Bandwidth and robots.txt were cited as reasons an
extraordinary effort should have been made in this case.  It was also
noted that administrators were able to contact the National Library and
request that their robots.txt files be honored.  This only makes sense
if they were aware of the harvest before it began.  It was also noted
that some smaller content providers 'lurk' on the  NZNOG list to receive
updates such as this.  National Library have undertaken to increase
notification in mailing lists (such as NZNOG) in any future harvests.

Missing NZ only content:
Since de-peering, a large amount of New Zealand content is only
available from within New Zealand.  This content will be missing from
the current harvest as it was conducted from an international source.

Internet harvest vs Real world books.
Some discussion occurred around the comparison between collecting
Internet content and the obligation of publishers to send copies of
works to the National Library.  The point was made that even though
publishers are required by law to deposit works in the Library.  They
are not required to do this at considerable personal expense (Paying
international traffic charges rather then local).

Ways to combat additional harvests:
There was discussion around possible ways to avoid being harvested in
the future.  These centered around blocking IPs and blocking  certain 
HTTP strings.  It was mentioned that the National Library would rather
people did not do this and that contacting them to have a robots.txt
file registered would be a preferable option.

Speed of Harvest:
It was noted that although the majority of website owners are indexed by
Google on a fairly regular basis, Google takes a "Slow, over time"
approach to indexing.  The Harvest took an "as fast as possible"
approach.  It was felt that this contributed to an unnecessary impact on
some content providers internet links.

.nz Domain Names:
A question was asked as to how the National Library was able to obtain a
list of sites to harvest.  The Domain Name Commissioner responded with
"I can confirm that the .nz zone file has not been released to the
National Library"

Please let me know if I've forgotten anything.

Regards,
Dean

[nznog] NLNZHarvester2008 Summary

Dean Pemberton