NLNZHarvester2008

Murray Fox

19 Oct 2008 19 Oct '08

2:31 p.m.

Hi All, This was news to me ('till international traffic on some of my sites began to climb); The National Library is this month ripping the entire NZ Internet ..... from an international host ..... blatantly ignoring robots.txt http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2008 International traffic in NZ often costs end customers real money (especially in colo situations). A site I'm involved with, entirely NZ oriented content, served up 2GB to these guys in one day (that's half a normal month of international). Nevermind the sovereignty question; presumably the goal is that by the end of Oct the entire NZ Internet will be mirrored at nlnzdata000.us.archive.org :-/ Anyone else raging about this? Ignoring the robots.txt issue, why couldn't this have been done over national links? Or damn, at the very least they could have pinged off an email to webmaster(a)target.co.nz ahead of ripping the site to let admins get prepared? M.

Show replies by date

Neil Gardner

19 Oct 19 Oct

2:36 p.m.

On Mon, Oct 20, 2008 at 4:31 PM, Murray Fox <mfox(a)actrix.co.nz> wrote:

...

This was news to me ('till international traffic on some of my sites began to climb); The National Library is this month ripping the entire NZ Internet ..... from an international host ..... blatantly ignoring robots.txt

http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2008

Fail. I think that covers it. Cheers - N

Geraint Jones

2:40 p.m.

Murray Fox wrote:

...

Anyone else raging about this? Ignoring the robots.txt issue, why couldn't this have been done over national links? Or damn, at the very least they could have pinged off an email to webmaster(a)target.co.nz ahead of ripping the site to let admins get prepared?

Raging isn't the word - the people behind it were so thoughtful, I can only assume they didn't know that vhosting is rather common. this has caused us some lovely late night/early morning alerts. I also question the legality of archiving a work that may be copyright with out seeking the owners permission. Geraint

Matthew Poole

2:55 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Geraint Jones wrote: | I also question the legality of archiving a work that may be copyright | with out seeking the owners permission. | Oh, that's very legal. The Copyright Act 1994 provides exemption for copying pursuant to statutory functions, and the National Library is statutorily required to make archives such as this. There's no grey area here, it's all thoroughly within the law. - -- Matthew Poole "Don't use force. Get a bigger hammer." -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkj8ATgACgkQTdEtTmUCdpxEBACdHlW8jQmgQnsFSmBuch2mo5zt 8KgAoMDbFCD1tlxRWbfL6bmgKbYuBbwy =dP/U -----END PGP SIGNATURE-----

Simon Lyall

3:06 p.m.

On Mon, 20 Oct 2008, Matthew Poole wrote:

...

Oh, that's very legal. The Copyright Act 1994 provides exemption for copying pursuant to statutory functions, and the National Library is statutorily required to make archives such as this. There's no grey area here, it's all thoroughly within the law.

Even better they can probably use Section 25 of the copyright act to claim copyright over their "edition" of your content when they put it online. -- Simon Lyall | Very Busy | Web: http://www.darkmere.gen.nz/ "To stay awake all night adds a day to your life" - Stilgar | eMT.

Ricard Kelly

20 Oct 20 Oct

11:49 a.m.

The question I have is whether Section 36 of the National Library Act was satisfied: "Before the Minister notifies a requirement, the Minister must consult the publishers or representatives of the publishers likely to be affected by the proposed requirement about the terms and conditions referred to in section 31(2)(b) or (3)." Section 31(3) being the one under which the Minister issued a requirement in the Gazette (11 May 2006). What actions did the Minister take to consult with the publishers likely to be affected? The Act defines: "publisher means,-[...] (c) in relation to an internet document, the person who has control over the content of the website, or part of the website, on which the document is located" My viewpoint would be that a requirement to download all NZ internet documents affects many publishers (even if just because their traffic bill went up); otherwise nobody here would be complaining... Ricard -----Original Message----- From: Matthew Poole [mailto:matt(a)p00le.net] Sent: Monday, 20 October 2008 4:56 p.m. To: nznog(a)list.waikato.ac.nz Subject: Re: [nznog] NLNZHarvester2008 * PGP Signed by an unverified key: 10/20/08 at 16:55:36 Geraint Jones wrote: | I also question the legality of archiving a work that may be copyright | with out seeking the owners permission. | Oh, that's very legal. The Copyright Act 1994 provides exemption for copying pursuant to statutory functions, and the National Library is statutorily required to make archives such as this. There's no grey area here, it's all thoroughly within the law. -- Matthew Poole "Don't use force. Get a bigger hammer." * Matthew Poole <matt(a)p00le.net> * 0x6502769C - Unverified(L) _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog <!D2S:support(a)intrical.com/good/5240f034/>

Nathan Ward

19 Oct 19 Oct

2:47 p.m.

On 20/10/2008, at 4:31 PM, Murray Fox wrote:

...

Hi All,

This was news to me ('till international traffic on some of my sites began to climb); The National Library is this month ripping the entire NZ Internet ..... from an international host ..... blatantly ignoring robots.txt

http://www.natlib.govt.nz/about-us/current-initiatives/web- harvest-2008

International traffic in NZ often costs end customers real money (especially in colo situations). A site I'm involved with, entirely NZ oriented content, served up 2GB to these guys in one day (that's half a normal month of international).

Nevermind the sovereignty question; presumably the goal is that by the end of Oct the entire NZ Internet will be mirrored at nlnzdata000.us.archive.org :-/ apologies

[quote] The harvested web pages will be stored at the Library, and will eventually be made publicly accessible. [/quote] Presumably it's cheaper for them to access a large amount of NZ content from a server in the US, and/or the archive.org guys are doing the hard yards for them.

...

Anyone else raging about this? Ignoring the robots.txt issue, why couldn't this have been done over national links? Or damn, at the very least they could have pinged off an email to webmaster(a)target.co.nz ahead of ripping the site to let admins get prepared?

I haven't seen it hit any of my content yet.. perhaps because it's on .com. It'll be interesting, I've got many many GB of data (forums, etc.) But, Google etc. walk my content fairly regularly, so I'm not that concerned. -- Nathan Ward

Criggie

2:52 p.m.

Nathan Ward wrote:

...

But, Google etc. walk my content fairly regularly, so I'm not that concerned.

Google obeys robots.txt -- Criggie

Jasper Bryant-Greene

2:54 p.m.

On 20/10/2008, at 4:52 PM, Criggie wrote:

...

Nathan Ward wrote:

...
But, Google etc. walk my content fairly regularly, so I'm not that concerned.

Google obeys robots.txt

It would be very nice if natlib's crawler at least obeyed robots.txt entries that were specifically directed at them. Even nicer if they crawled from an NZ system, seeing as the international traffic will cost some small customers with lots of content real money. I left a message for a contact at natlib, maybe they can at least obey specific robots.txt entries. -- Jasper Bryant-Greene Network Engineer, Unleash ddi: +64 3 978 1222 mob: +64 21 129 9458

Murray Fox

2:57 p.m.

...

Presumably it's cheaper for them to access a large amount of NZ content from a server in the US, and/or the archive.org guys are doing the hard yards for them.

Indeed it probably is; the disregard for the target sites however I find reprehensible - The NL want to archive the Internet? Let them foot the bill.

...

I haven't seen it hit any of my content yet.. perhaps because it's on .com.

It'll be interesting, I've got many many GB of data (forums, etc.)

But, Google etc. walk my content fairly regularly, so I'm not that concerned.

Sure, but at least Google walk where you tell them to. M.

...

-- Nathan Ward

_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog

Geraint Jones

3:02 p.m.

Murray Fox wrote:

...

...
I haven't seen it hit any of my content yet.. perhaps because it's on .com.

It'll be interesting, I've got many many GB of data (forums, etc.)

But, Google etc. walk my content fairly regularly, so I'm not that concerned.

Sure, but at least Google walk where you tell them to.

M.

And google do it slowly over a few days/weeks - this bot does it as fast as it can.

Michael Newbery

2:54 p.m.

On 20/10/08 4:31 PM, "Murray Fox" <mfox(a)actrix.co.nz> wrote:

...

Hi All,

This was news to me ('till international traffic on some of my sites began to climb); The National Library is this month ripping the entire NZ Internet ..... from an international host ..... blatantly ignoring robots.txt

Somewhat risky for them surely? One thing robots.txt does is protect crawlers from infinite dynamic content. I'd have thought that at least *some* NZ sites have referer tarpits or recursive redirect blackholes that any sensible crawler would be best to avoid. -- Michael Newbery IP Architect TelstraClear Limited TelstraClear. Simple Solutions. Everyday Residential 0508 888 800 Business 0508 249 999 Enterprise & Government 0508 400 300 This email contains information which may be confidential and subject to copyright. If you are not the intended recipient you must not use, distribute or copy this email or attachments. If you have received this email in error please notify us immediately by return email and delete this email and any attachments. TelstraClear Limited accepts no responsibility for changes made to this email or to any attachments after transmission from TelstraClear Limited. It is your responsibility to check this email and any attachments for viruses. Emails are not secure. They can be intercepted, amended, lost or destroyed and may contain viruses. Anyone who communicates with TelstraClear Limited by email is taken to accept these risks.

Spiro Harvey

3 p.m.

...

Somewhat risky for them surely? One thing robots.txt does is protect crawlers from infinite dynamic content. I'd have thought that at least *some* NZ sites have referer tarpits or recursive redirect blackholes that any sensible crawler would be best to avoid.

And if they didn't before, they sure as heck will now. Or at least, they will before NLNZHarvester2009 starts crawling. -- Spiro Harvey Knossos Networks Ltd 021-295-1923 www.knossos.net.nz

Michael Fincham

3:23 p.m.

On Mon, 2008-10-20 at 16:31 +1300, Murray Fox wrote:

...

NZ Internet ..... from an international host ..... blatantly ignoring robots.txt

I don't suppose you or anyone else could share any info re: the subnet the requests are coming from or point to this info if it's published somewhere? I presume it'll be something inside one of The Internet Archive's allocations but am curious for specifics if they're available. -- -Michael Fincham System Administrator, Unleash www.unleash.co.nz Phone: 0800 750 250 DDI: 03 978 1223 Mobile: 027 666 4482

Geraint Jones

3:30 p.m.

Michael Fincham wrote:

...

On Mon, 2008-10-20 at 16:31 +1300, Murray Fox wrote:

...
NZ Internet ..... from an international host ..... blatantly ignoring robots.txt

I don't suppose you or anyone else could share any info re: the subnet the requests are coming from or point to this info if it's published somewhere?

I presume it'll be something inside one of The Internet Archive's allocations but am curious for specifics if they're available.

cat vhost-access_log.* | grep web-harvest-2008 | awk {'print $2'} | sort | uniq 149.20.55.4 207.241.232.188 Those are the two we have seen to date

Murray Fox

3:30 p.m.

On Mon, 2008-10-20 at 17:23 +1300, Michael Fincham wrote:

...

On Mon, 2008-10-20 at 16:31 +1300, Murray Fox wrote:

...
NZ Internet ..... from an international host ..... blatantly ignoring robots.txt

I don't suppose you or anyone else could share any info re: the subnet the requests are coming from or point to this info if it's published somewhere?

I've seen connections so far from these hosts: 207.241.232.188 149.20.55.4 Cheers, M.

Sean Davidson

3:39 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 UoA had notice that connections would be from :

...

All of therr requests should be originating from either of these two IPs:

149.20.55.4 207.241.232.188

Murray Fox wrote:

...

On Mon, 2008-10-20 at 17:23 +1300, Michael Fincham wrote:

...
...
NZ Internet ..... from an international host ..... blatantly ignoring robots.txt I don't suppose you or anyone else could share any info re: the subnet

On Mon, 2008-10-20 at 16:31 +1300, Murray Fox wrote: the requests are coming from or point to this info if it's published somewhere?

I've seen connections so far from these hosts:

207.241.232.188 149.20.55.4

Cheers,

M.

_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog

- -- =================================================== Sean Davidson Student IT Services Faculty of Science University of Auckland Mobile: +27 021 668 984 Office: +27 09 3737 599 ext 85602 =================================================== -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkj8C4gACgkQjJn2jKX28bNIgwCdEWLpTGL4GzJnz/5gWsw2hVxh 3uwAoLVjTYKbjWIXw6ER/HNxe/XMXlaI =2Z2H -----END PGP SIGNATURE-----

Clark Mills

3:53 p.m.

FYI: Message sent to us @ auckland.ac.nz ============================================================================== This is a heads up that the National Library of NZ has embarked on a web harvest of the .nz domain as part of their legal mandate to collect and preserve NZ’s online documentary heritage. They are outsourcing the work to the Internet Archive (Way Back Machine etc). You can see the latest information here: http://www.natlib.govt.nz/about-us/news/15-october-2008-update-on-web-harves... with more background at: http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how... http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2008 Things to note are: - the crawlers are working in bursts of 500 URLs at a time - they will not honour the robots.txt protocol - they won’t harvest password protected content If you are checking logs, the user agent string that appears should look like this: Mozilla/5.0 (compatible; NLNZHarvester2008 +http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-20 08) All of therr requests should be originating from either of these two IPs: 149.20.55.4 207.241.232.188 The harvest is happening between 7-24 October with subsequent patch crawls until 7 November 2008. If you have any concerns let me know. If there is any impact on service (unlikely, but you never know) please contact the National Library directly at web-harvest-2008(a)natlib.govt.nz and they will stop or modify the harvest as quickly as possible.

Craig Whitmore

4:22 p.m.

On Mon, 2008-10-20 at 17:53 +1300, Clark Mills wrote:

...

FYI: Message sent to us @ auckland.ac.nz

==============================================================================

This is a heads up that the National Library of NZ has embarked on a web harvest of the .nz domain as part of their legal mandate to collect and preserve NZ’s online documentary heritage. They are outsourcing the work to the Internet Archive (Way Back Machine etc).

They are not checking ipv6 sites :-( Some of my websites are not going to be archived for the rest of eternity :-( and they are missing out on my great content and NZ's history.. Also if they don't harvest a website.. how do I get harvested? What about subdomains? or only if they are linked off another website the spider knows about them? Also how do they get the list of websites to harvest? I am pretty sure the DNC don't have permission to give a list of all .nz websites to someone else to harvest their websites? or do they? Thanks Craig

Jasper Bryant-Greene

4:24 p.m.

On 20/10/2008, at 6:22 PM, Craig Whitmore wrote:

...

They are not checking ipv6 sites :-( Some of my websites are not going to be archived for the rest of eternity :-( and they are missing out on my great content and NZ's history..

Worth suggesting that to them, they probably just haven't thought of it :-)

...

Also if they don't harvest a website.. how do I get harvested? What about subdomains? or only if they are linked off another website the spider knows about them?

Also how do they get the list of websites to harvest? I am pretty sure the DNC don't have permission to give a list of all .nz websites to someone else to harvest their websites? or do they?

http://www.natlib.govt.nz/about-us/news/20-october-2008-web-harvest-faqs Cheers Jasper

Debbie Monahan

20 Oct 20 Oct

7:39 a.m.

I can confirm that the .nz zone file has not been released to the National Library Regards Debbie Monahan Domain Name Commissioner -----Original Message----- From: Craig Whitmore [mailto:lennon(a)orcon.net.nz] Sent: Monday, 20 October 2008 6:23 p.m. To: Clark Mills Cc: nznog List Subject: Re: [nznog] NLNZHarvester2008 On Mon, 2008-10-20 at 17:53 +1300, Clark Mills wrote:

...

FYI: Message sent to us @ auckland.ac.nz

====================================================================== ========

This is a heads up that the National Library of NZ has embarked on a web harvest of the .nz domain as part of their legal mandate to collect and preserve NZ's online documentary heritage. They are outsourcing the work to the Internet Archive (Way Back Machine etc).

bmanning＠vacation.karoshi.com

19 Oct 19 Oct

10:18 p.m.

On Mon, Oct 20, 2008 at 04:31:43PM +1300, Murray Fox wrote:

...

Hi All,

This was news to me ('till international traffic on some of my sites began to climb); The National Library is this month ripping the entire NZ Internet ..... from an international host ..... blatantly ignoring robots.txt

http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2008

so - is this a snapshot or is the intent expected to be on-going? --bill

Andy Linton

20 Oct 20 Oct

7:51 a.m.

I find it interesting the number of people on this list who've put up content on their web sites and then get upset that someone has downloaded it. If you really don't want it disseminated on the net then don't publish it. Or is the problem that "big brother" is watching you? (:-)

Jamie Baddeley

7:54 a.m.

On Tue, 2008-10-21 at 09:51 +1300, Andy Linton wrote:

...

I find it interesting the number of people on this list who've put up content on their web sites and then get upset that someone has downloaded it.

Probably more the 'flash load' aspects and a general lack of awareness of it happening without time to plan for or deal with it.

...

If you really don't want it disseminated on the net then don't publish it.

Or is the problem that "big brother" is watching you? (:-)

Now, where is that tinfoil hat? jamie

Murray Fox

8:16 a.m.

On Tue, 2008-10-21 at 09:51 +1300, Andy Linton wrote:

...

I find it interesting the number of people on this list who've put up content on their web sites and then get upset that someone has downloaded it.

I think you're choosing to miss the point(s) :-) This is the National Library of NEW ZEALAND, ripping sites over international links, with blatant disregard for those sites wishes with regard to robots, and with no direct communication to the site owner / maintainer. The practical upshot of this is that real people running sites in NZ face increased international bandwidth bills this month thanks to the actions of the National Library [1] Obviously the content is public, and anyone can access it. No one is complaining that their site is being accessed. I'm personally aggrieved that: a) This wasn't kept within the country and performed over national connectivity (for both ideological and financial reasons) b) A heads up to the site wasn't sent ahead of time

...

If you really don't want it disseminated on the net then don't publish it.

Facetious comment, not the point.

...

Or is the problem that "big brother" is watching you? (:-)

I actually have no issue at all with this initiative, only the manner in which it has been executed. [1] - One thing to remember is that this IS the National Library; an organisation I would have expected to have acted more responsibly. M.

Michael Jager

8:32 a.m.

On 21/10/08 10:16, Murray Fox wrote:

...

I think you're choosing to miss the point(s) :-) This is the National Library of NEW ZEALAND, ripping sites over international links, with blatant disregard for those sites wishes with regard to robots, and with no direct communication to the site owner / maintainer. The practical upshot of this is that real people running sites in NZ face increased international bandwidth bills this month thanks to the actions of the National Library [1]

Obviously the content is public, and anyone can access it. No one is complaining that their site is being accessed. I'm personally aggrieved that:

a) This wasn't kept within the country and performed over national connectivity (for both ideological and financial reasons)

b) A heads up to the site wasn't sent ahead of time

...
If you really don't want it disseminated on the net then don't publish it.

Facetious comment, not the point.

...
Or is the problem that "big brother" is watching you? (:-)

I actually have no issue at all with this initiative, only the manner in which it has been executed.

[1] - One thing to remember is that this IS the National Library; an organisation I would have expected to have acted more responsibly.

At the risk of "me too"ing, I find it somewhat amusing that the National Library, as one of the peers at WIX before peering was cool, would arrange for this content to be mirrored by machines that are not only not reachable via the relevant IXPs, but are located at the other end of international transit links. Presumably this mirrored content is going to be hosted by the National Library (domestically) at some point, which will result in all that data being hauled back across the Pacific? As a reasonably large provider of NZ-based hosting services, we've certainly noticed NatLib's mirroring activity as a non-negligible increase on our international bandwidth utilisation over the last few weeks. Yes, there are a million and one things that'll have an impact on your bandwidth consumption. However, I echo Murray's frustration that a New Zealand organisation either wouldn't think about this, or have completely ignored it. -Mike

Craig Whitmore

8:51 a.m.

...

As a reasonably large provider of NZ-based hosting services, we've certainly noticed NatLib's mirroring activity as a non-negligible increase on our international bandwidth utilisation over the last few weeks. Yes, there are a million and one things that'll have an impact on your bandwidth consumption. However, I echo Murray's frustration that a New Zealand organisation either wouldn't think about this, or have completely ignored it.

Also there is the problem of "NZ Only Content" which the international spiders will not see/get such as the tvnzondemand and the citylink NZ Anycast content (+ others I am sure) They also note that they will "stop" at a certain point so some large websites will not get mirrored 100% .. So whats the point of that? you either want all or nothing. I am 100% there is some content people don't want to be spidered. a quick .htaccess to deny their spider. (I think this will work) RewriteEngine On Options +FollowSymLinks RewriteCond %{HTTP_USER_AGENT} ^NLNZHarvester2008 RewriteRule ^.* - [F,L] (And the spider will get a 403 - hopefully) Thanks Craig

Brad Pearpoint

9:02 a.m.

Quote from their site: " If you ignore robots.txt, what's to stop me blocking your crawler's IP address? Nothing. Some webmasters have taken this action, and we're sorry they felt they had to go to these lengths. We are running this harvest with good intentions, and ask that if you have blocked us, you reconsider - for example by allowing the harvester to access your site on the condition that it honours robots.txt. We'd much prefer this outcome to getting nothing from your websites at all. Please remember that this project is about trying to ensure that as much as possible of the social history being enacted on the web today is available to researchers and all New Zealanders in the future. If we don't capture it now, we may not have the chance later. " If you wanted to, you could simply ask them to obey your robots.txt. Brad Pearpoint -----Original Message----- From: Craig Whitmore [mailto:lennon(a)orcon.net.nz] Sent: Tuesday, 21 October 2008 10:51 a.m. To: Michael Jager Cc: nznog(a)list.waikato.ac.nz Subject: Re: [nznog] NLNZHarvester2008

...

As a reasonably large provider of NZ-based hosting services, we've certainly noticed NatLib's mirroring activity as a non-negligible increase on our international bandwidth utilisation over the last few weeks. Yes, there are a million and one things that'll have an impact on your bandwidth consumption. However, I echo Murray's frustration that a New Zealand organisation either wouldn't think about this, or have completely ignored it.

Miskell, Craig

9:06 a.m.

...

-----Original Message----- From: Brad Pearpoint [mailto:Brad(a)advantage.co.nz] Sent: Tuesday, 21 October 2008 11:03 a.m. To: Craig Whitmore Cc: nznog(a)list.waikato.ac.nz Subject: Re: [nznog] NLNZHarvester2008

Quote from their site:

" If you ignore robots.txt, what's to stop me blocking your crawler's IP address?

Nothing.

Some webmasters have taken this action, and we're sorry they felt they had to go to these lengths. We are running this harvest with good intentions, and ask that if you have blocked us, you reconsider - for example by allowing the harvester to access your site on the condition that it honours robots.txt. We'd much prefer this outcome to getting nothing from your websites at all.

Please remember that this project is about trying to ensure that as much as possible of the social history being enacted on the web today is available to researchers and all New Zealanders in the future. If we don't capture it now, we may not have the chance later.

"

If you wanted to, you could simply ask them to obey your robots.txt.

We already have, by having a robots.txt file. Shouldn't have to ask twice. Craig Miskell ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. =======================================================================

Simon Lyall

9:33 a.m.

On Tue, 21 Oct 2008, Miskell, Craig wrote:

...

We already have, by having a robots.txt file. Shouldn't have to ask twice.

User-agent: * Disallow: /recruitment Which I think highlights the problem. Many people have robots.txt files because they have some content they don't want archived by others, other people have load and bandwidth issues. The National Library really has to ignore the first group but at the cost of hitting the second group. -- Simon Lyall | Very Busy | Web: http://www.darkmere.gen.nz/ "To stay awake all night adds a day to your life" - Stilgar | eMT.

Jasper Bryant-Greene

9:38 a.m.

On Tue, 21 Oct 2008 11:33:41 +1300 (NZDT), Simon Lyall <simon(a)darkmere.gen.nz> wrote:

...

On Tue, 21 Oct 2008, Miskell, Craig wrote:

...
We already have, by having a robots.txt file. Shouldn't have to ask twice.

User-agent: * Disallow: /recruitment

Which I think highlights the problem. Many people have robots.txt files because they have some content they don't want archived by others, other people have load and bandwidth issues.

The National Library really has to ignore the first group but at the cost

...

of hitting the second group.

But surely they could obey robots.txt entries that specifically target them? User-agent: NLNZHarvester2008 Disallow: /massive-collection-of-high-res-images -- Jasper Bryant-Greene Network Engineer, Unleash ddi: +64 3 978 1222 mob: +64 21 129 9458

Nathan Ward

9:40 a.m.

On 21/10/2008, at 11:38 AM, Jasper Bryant-Greene wrote:

...

On Tue, 21 Oct 2008 11:33:41 +1300 (NZDT), Simon Lyall <simon(a)darkmere.gen.nz> wrote:

...
On Tue, 21 Oct 2008, Miskell, Craig wrote:

...
We already have, by having a robots.txt file. Shouldn't have to ask twice.

User-agent: * Disallow: /recruitment

Which I think highlights the problem. Many people have robots.txt files because they have some content they don't want archived by others, other people have load and bandwidth issues.

The National Library really has to ignore the first group but at the cost

...
of hitting the second group.

But surely they could obey robots.txt entries that specifically target them?

User-agent: NLNZHarvester2008 Disallow: /massive-collection-of-high-res-images

I believe one complaint was a lack of forward notice. -- Nathan Ward

Barry Murphy

10:26 a.m.

<1224537417.6874.23.camel(a)oberon> <48FCF90A.4010609(a)webdrive.co.nz> <1224539473.17842.14.camel(a)localhost> <D333E64FEDA1394A9AE9430C711402C401736465BD(a)EXCHANGE.advantage.co.nz> <DBC13D10DC0C754AB48A0E95D39026510C62380CCC(a)exchsth.agresearch.co.nz> <alpine.DEB.1.00.0810211130240.6897(a)green.darkmere.gen.nz> <5b1922bbe7a7501041a3fc1eb9c95dc8(a)localhost> Message-ID: <0d338fea8445d9b74d6cfc26ba307a80(a)localhost> X-Sender: barry(a)unix.co.nz Received: from 203-96-86-10.cid.global-gateway.net.nz [203.96.86.10] with HTTP/1.1 (POST); Tue, 21 Oct 2008 12:26:45 +1300 User-Agent: RoundCube Webmail/0.1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit

...

But surely they could obey robots.txt entries that specifically target them?

User-agent: NLNZHarvester2008 Disallow: /massive-collection-of-high-res-images

User-agent: NLNZHarvester2008 Disallow: /my-porn-collection Beer barry

Murray Fox

9:12 a.m.

...

If you wanted to, you could simply ask them to obey your robots.txt.

Which would be entirely in vain after the fact :-/

Mark Foster

9:43 a.m.

...

...
If you wanted to, you could simply ask them to obey your robots.txt.

Which would be entirely in vain after the fact :-/

I was somewhat suprised that despite being on various mailing lists, I didn't hear about NatLib's harvest until well after it had kicked off. I have asked NatLib to honour robots.txt in the case of the sites I personally host, and received a prompt affirmative response from them on the request... I was also advised that they "remain committed to our strategy of ignoring robots.txt unless requested otherwise." I in turn suggested that perhaps NatLib needed to pay attention to NZNOG... to me this discussion should've happened several weeks ago, so that webhosts were better able to plan/respond to the situation. Mark.

Gerard Creamer

10:12 a.m.

On 21/10/2008 11:02 a.m., Brad Pearpoint wrote:

...

If you wanted to, you could simply ask them to obey your robots.txt.

There is a perfectly functional mechanism in existence which can be used to ask folks with automated crawlers to leave some parts of your site alone. It's called a robots.txt file. What mechanism is in place to ask people to check if they really, really, absolutely, yes even natlib, truely should check the robots.txt? A robots.robots.txt file? I assume they don't actually want 150,000 emails from webmasters detailing what to crawl and what to leave... Some thought has been put into these things by Clever People (TM) - why does the National Library believe they can ignore that body of thought? Gerard

Peter Mott

10:19 a.m.

...

Some thought has been put into these things by Clever People (TM) - why does the National Library believe they can ignore that body of thought?

Because if you publish it on the world wide web, they appear to have a legal mandate to take a copy of it at their pleasure. Discussion on how to stop them downloading the data is not relevant. You are required to provide a copy of your publication without fee, although to be fair a book publisher does not have to courier 2 copies to the Nat Library office in Redwood City, CA. regards Peter Mott Swizzle | Wholesale Hosted Servers +64 21 279 4995 -/-

Gerard Creamer

10:27 a.m.

On 21/10/2008 12:19 p.m., Peter Mott wrote:

...

...
Some thought has been put into these things by Clever People (TM) - why does the National Library believe they can ignore that body of thought?

Because if you publish it on the world wide web, they appear to have a legal mandate to take a copy of it at their pleasure.

Discussion on how to stop them downloading the data is not relevant.

You are required to provide a copy of your publication without fee, although to be fair a book publisher does not have to courier 2 copies to the Nat Library office in Redwood City, CA.

I guess that's the real point - if they're committed to using the services of an offshore provider you'd think they'd proxy it through their own international - that way we'd only be providing the content locally. Gerard

Andy Linton

10:06 a.m.

Murray Fox wrote:

...

I think you're choosing to miss the point(s) :-) This is the National Library of NEW ZEALAND, ripping sites over international links, with blatant disregard for those sites wishes with regard to robots, and with no direct communication to the site owner / maintainer. The practical upshot of this is that real people running sites in NZ face increased international bandwidth bills this month thanks to the actions of the National Library [1]

Well I'm not sure I am missing the point. If you published a book in NZ you'd need to provide two copies of it at your cost to them. And that's not unique to NZ. You don't get the option to say it was a bit expensive for us to provide those two copies so we've sent you the cover and the title page. I agree that they could have handled this better but perhaps the issue of the bandwidth costs is one where some of the anomalies in our bandwidth pricing structure, lack of peering by the controlling duopoly and availability of skills to do this work meant that doing it this way constitutes the best cost structure for Natlib to get this information that they are now required to attempt to collect. As a taxpayer I'd hope they are taking that approach. Looks like this activity will continue - the question really is how does the community best engage with the library to make it as painless as possible.

Sid Jones

10:20 a.m.

On Tue, Oct 21, 2008 at 12:06 PM, Andy Linton <asjl(a)lpnz.org> wrote:

...

Well I'm not sure I am missing the point. If you published a book in NZ you'd need to provide two copies of it at your cost to them. And that's not unique to NZ. You don't get the option to say it was a bit expensive for us to provide those two copies so we've sent you the cover and the title page.

But you do get the option to ship the 2 copies across town via the cheapest courier and not using UPS via San Francisco ;) -- cheers, Sid

Dean Pemberton

10:39 a.m.

I'm trying to get some dialog with National Library arranged to discuss this policy with them. ISPANZ have already expressed an interest to be present. I'm talking to InternetNZ as well. Should I manage to get in front of them, I'll present a summary of the points of view expressed on this mailing list. So make sure your views have been expressed on the list by someone. Dean

...

Looks like this activity will continue - the question really is how does the community best engage with the library to make it as painless as possible.

Alex Hague

11:29 a.m.

On Tue, Oct 21, 2008 at 1:39 PM, Dean Pemberton <nznog(a)deanpemberton.com>wrote:

...

So make sure your views have been expressed on the list by someone.

I'm from a small (but growing) Kiwi website that has community generated content etc. I think that there is an additional point that has been missed so far: the web is no longer static. The uncertainty principal begins to apply - by them crawling entire sites they may begin to interact with the content on the sites inadvertently. For example there can be links to flag content as inappropriate. We use robots.txt to prevent crawlers from hitting this kind of link as well as indexing our APIs (which return XML | JSON) and are no use to a crawler (but which they seem to love indexing). The issue for us isn't that they are indexing our site, it is that they are disobeying robots.txt. Even this in itself wouldn't be a problem if they provided a heads up in future and compromised by following robots.txt entries that targeted their user agent. We lurk on this list specifically to be aware of this kind of activity, it is incredibly arrogant of them to undertake such a massive project knowing that they will cause headaches (by not following robots.txt) without consulting such an obvious place as NZNOG. Cheers, Alex

Jasper Bryant-Greene

11:36 a.m.

On Tue, 21 Oct 2008 14:29:44 +1400, "Alex Hague" <mohawkalex(a)gmail.com> wrote:

...

I'm from a small (but growing) Kiwi website that has community generated content etc. I think that there is an additional point that has been missed so far: the web is no longer static. The uncertainty principal begins to apply - by them crawling entire sites they may begin to interact with the content on the sites inadvertently.

As far as I'm aware, the big crawlers don't perform POST, PUT, DELETE queries. Seeing as HTTP requires GET to be idempotent, and not take any action other than retrieval, crawlers won't "interact" with well-designed websites if by "interact" you mean "change stuff".

...

For example there can be links to flag content as inappropriate. We use robots.txt to prevent crawlers from hitting this kind of link as well as indexing our APIs (which return XML | JSON) and are no use to a crawler (but which they seem to love indexing).

If the APIs return an appropriate Content-Type and the crawlers still retrieve them, then the crawlers are either genuinely interested in indexing the content retrieved by those APIs, or they're buggy and you should report the issue. Cheers, -- Jasper Bryant-Greene Network Engineer, Unleash ddi: +64 3 978 1222 mob: +64 21 129 9458

Alex Hague

12:14 p.m.

...

Seeing as HTTP requires GET to be idempotent, and not take any action other than retrieval, crawlers won't "interact" with well-designed websites if by "interact" you mean "change stuff".

The RFC uses SHOULD NOT rather than MUST NOT, the consequences of flagging

content as inappropriate are *safe* which is the gist of section 9.1.1, but can be annoying if something comes along and flags all your content as inappropriate. This annoyance is an acceptable outcome when the risk has been mitigated by implementing robots.txt (which I recognize is not a standard, but is so widely adopted I wouldn't expect trouble from a place like NLNZ). As far the GET requests to links such as flagging content being idempotent, no one has said that they aren't - in the context of section 9.1.2 of the RFC, idempotent means that multiple identical requests have no greater side effect than the original request.

...

If the APIs return an appropriate Content-Type and the crawlers still retrieve them, then the crawlers are either genuinely interested in indexing the content retrieved by those APIs, or they're buggy and you should report the issue.

Just because a crawler wants API content doesn't mean that site owners want the crawler to have it. Our APIs return the correct content-type header, yet we started finding XML used to create graphs appearing in search results for us in Google. Cheers, Alex

Perry Lorier

2:30 p.m.

Jasper Bryant-Greene wrote:

...

On Tue, 21 Oct 2008 14:29:44 +1400, "Alex Hague" <mohawkalex(a)gmail.com> wrote:

...
I'm from a small (but growing) Kiwi website that has community generated content etc. I think that there is an additional point that has been missed so far: the web is no longer static. The uncertainty principal begins to apply - by them crawling entire sites they may begin to interact with the content on the sites inadvertently.

As far as I'm aware, the big crawlers don't perform POST, PUT, DELETE queries.

Seeing as HTTP requires GET to be idempotent, and not take any action other than retrieval, crawlers won't "interact" with well-designed websites if by "interact" you mean "change stuff".

Web crawlers can wander around grabbing difficult to generate dynamic content repeatedly. This content may be generated from queries from a database and require quite a bit of cpu and/or memory to convert to a form that's usable in a browser. While a user grabs a few pages of this slow dynamic content to answer whatever question they may have, because there may be potentially infinite ways of presenting this data, the crawler may inadvertantly start using up a lot of resources in the form of RAM and CPU. Often caching is applied to queries, since multiple people generally end up making similar, or even the exact same query. However a bot generating millions of these queries can quickly fill up caches forcing expiry of "old" (but not yet stale) entries from the cache. (Or if the code wasn't written sufficiently well, filling up the disk the cache is stored on). Usually you protect such content with robots.txt, and robots meta headers telling robots to stay away. While my sites are currently setup I believe can handle this kind of load, in the past this has been a problem. Keeping robots out of an area may be there for the robots protection, not to try and "hide" the content. (Although I agree that people making sites that can change state with a GET are asking for trouble.)

Patrick Jordan-Smith

2:57 p.m.

On 21/10/2008, at 4:30 PM, Perry Lorier wrote:

...

Often caching is applied to queries, since multiple people generally end up making similar, or even the exact same query. However a bot generating millions of these queries can quickly fill up caches forcing expiry of "old" (but not yet stale) entries from the cache. (Or if the code wasn't written sufficiently well, filling up the disk the cache is stored on).

If you read the FAQ from the National Library, it does say: "In practical terms, this means webmasters can expect the harvester to work in bursts, taking 100 URLs from each website before moving on the next. Eventually the harvester will cycle back around to collect the next 100 URLs from the site. The exceptions to this are Government, Research, and Maori sites (.govt.nz, .ac.nz, .cri.nz and .maori.nz) where we harvest 500 URLs at a time." Which means you can expect to only see 100 pages requested at a time then some time for your 286 to recover before the next 100 requests comes along. This should resolve any worries about the crawler crapping all over the performance of your site(s). Cheers, Patrick

Geraint Jones

3:09 p.m.

...

If you read the FAQ from the National Library, it does say:

"In practical terms, this means webmasters can expect the harvester to work in bursts, taking 100 URLs from each website before moving on the next. Eventually the harvester will cycle back around to collect the next 100 URLs from the site. The exceptions to this are Government, Research, and Maori sites (.govt.nz, .ac.nz, .cri.nz and .maori.nz) where we harvest 500 URLs at a time."

Which means you can expect to only see 100 pages requested at a time then some time for your 286 to recover before the next 100 requests comes along.

This should resolve any worries about the crawler crapping all over the performance of your site(s).

Cheers, Patrick

That ignores the fact that you may have 1000's domains on one box - as we do where 80% of our domains are accessed once a year, and we have 4 or 5 really active domains on a box. When the crawler comes and it grabs 30 domains on one box its suddenly doing 3000 requests - oh and when it says pages - that means a page, the bot also grabs any images for that page. Hopefully you can see how this would cause an impact on shared hosting providers. G

bmanning＠vacation.karoshi.com

3:17 p.m.

one wonders... does the right hand know what the left hand is doing? how does one reconcile off-shoring all NZ content with this? http://www.stuff.co.nz/4732714a28.html --bill

Drew Calcott

11:37 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Alex Hague wrote:

...

The issue for us isn't that they are indexing our site, it is that they are disobeying robots.txt. Even this in itself wouldn't be a problem if they provided a heads up in future and compromised by following robots.txt entries that targeted their user agent.

And that is entirely the crux. It's more the sheer arrogance and lack of thought to implementation and the community that is causing people grief. Some respect towards the people involved shouldn't be too much to ask for when the avenues are so readily available. p.s. Long time, no see, Mr. Hague. - -- Drew Calcott Science IT University of Auckland (p) +64 9 373 7599 x84269 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkj9JFYACgkQD+yXTWfduLG2MgCgo4yvhxH128/ZlDzRevU1VnK0 KNwAnjO52+qFbv5CzaJKN96Ov1PvpCdp =21Xk -----END PGP SIGNATURE-----

Paul Vixie

10:59 a.m.

...

Looks like this activity will continue - the question really is how does the community best engage with the library to make it as painless as possible.

i've asked archive.org (the contractor who is doing this scan for lpnz) to contact me next time they are planning a run of this kind, so that we can arrange for a local (auckland) web proxy with good local peering. they've agreed in principle. i'm presuming that the "controlling duopoly" will also agree.

Matthew Poole

11:27 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Paul Vixie wrote:

...

arrange for a local (auckland) web proxy with good local peering. they've agreed in principle. i'm presuming that the "controlling duopoly" will also agree.

That's presuming a lot. Telecom might, since they seem to be getting back on the peering bandwagon, but TCL are still distinctly reluctant to peer[1] with anyone. There's that whole tension between selling international links for a whopping profit and being good internet citizens and supporting domestic peering. [1] For values of peering that equal "you pay for your circuit to the exchange and I'll pay for mine." - -- Matthew Poole "Don't use force. Get a bigger hammer." -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFI/SH8TdEtTmUCdpwRAk+NAJ98M9lHB6k8/vKxpuNsUHXTy8DaoQCg3vWg LXhXHx+/EzaANW2Js21wfdA= =hS+J -----END PGP SIGNATURE-----

Paul Vixie

11:40 a.m.

...

...
arrange for a local (auckland) web proxy with good local peering. they've agreed in principle. i'm presuming that the "controlling duopoly" will also agree.

That's presuming a lot. Telecom might, since they seem to be getting back on the peering bandwagon, but TCL are still distinctly reluctant to peer[1] with anyone.

well then web sites inside non-peering ISP's will have to get harvested from california. a situation where peering more would save them money may be what's best for everyone concerned.

Erin Salmon

11:47 a.m.

...

...
That's presuming a lot. Telecom might, since they seem to be getting back on the peering bandwagon, but TCL are still distinctly reluctant to peer[1] with anyone.

well then web sites inside non-peering ISP's will have to get harvested from california. a situation where peering more would save them money may be what's best for everyone concerned.

I don't see a lot wrong with that. A lot of heavy traffic stuff is done at the major IXPs and from an offshore network. Those who peer get the performance benefits and savings. That's _normal_. Erin Salmon

Dean Pemberton

12:34 p.m.

New subject: NLNZHarvester2008 Summary

There seem to be a number of issues here. To summarise the ones I can identify so far: 48 posts from 20/10/08 4:31 PM to 21/10/08 1:47 PM NZDT International Bandwidth Usage: The harvest was initiated from international sites and as such content owners were forced to pay for international bandwidth to return content. National Library is working with its contractors to discuss the possibility of a New Zealand based harvest site for any future use. robots.txt: The matter of crawling sites which contain robots.txt files was a hot topic. Subtopics around this were the fact that they would be honored if the webmaster requested this. It was not seen that this was necessary, as they were there for a reason in the first place. It was also noted that robots.txt were used for different valid purposes. Scan IPs: There was discussion around the IPs used to mount the scan. These seem to have been known to the group until logs were checked. This is in spite of the fact that they were provided by National Library on their website. Lack of notification: There seems to be a general feeling that more of an effort should have been made to notify industry about this harvest. The issues around International Bandwidth and robots.txt were cited as reasons an extraordinary effort should have been made in this case. It was also noted that administrators were able to contact the National Library and request that their robots.txt files be honored. This only makes sense if they were aware of the harvest before it began. It was also noted that some smaller content providers 'lurk' on the NZNOG list to receive updates such as this. National Library have undertaken to increase notification in mailing lists (such as NZNOG) in any future harvests. Missing NZ only content: Since de-peering, a large amount of New Zealand content is only available from within New Zealand. This content will be missing from the current harvest as it was conducted from an international source. Internet harvest vs Real world books. Some discussion occurred around the comparison between collecting Internet content and the obligation of publishers to send copies of works to the National Library. The point was made that even though publishers are required by law to deposit works in the Library. They are not required to do this at considerable personal expense (Paying international traffic charges rather then local). Ways to combat additional harvests: There was discussion around possible ways to avoid being harvested in the future. These centered around blocking IPs and blocking certain HTTP strings. It was mentioned that the National Library would rather people did not do this and that contacting them to have a robots.txt file registered would be a preferable option. Speed of Harvest: It was noted that although the majority of website owners are indexed by Google on a fairly regular basis, Google takes a "Slow, over time" approach to indexing. The Harvest took an "as fast as possible" approach. It was felt that this contributed to an unnecessary impact on some content providers internet links. .nz Domain Names: A question was asked as to how the National Library was able to obtain a list of sites to harvest. The Domain Name Commissioner responded with "I can confirm that the .nz zone file has not been released to the National Library" Please let me know if I've forgotten anything. Regards, Dean

6248

Age (days ago)

6249

Last active (days ago)

List overview

Download

53 comments

35 participants

participants (35)

Alex Hague
Andy Linton
Barry Murphy
bmanning＠vacation.karoshi.com
Brad Pearpoint
Clark Mills
Craig Whitmore
Criggie
Dean Pemberton
Debbie Monahan
Drew Calcott
Erin Salmon
Geraint Jones
Gerard Creamer
Jamie Baddeley
Jasper Bryant-Greene
Jasper Bryant-Greene
Mark Foster
Matthew Poole
Michael Fincham
Michael Jager
Michael Newbery
Miskell, Craig
Murray Fox
Nathan Ward
Neil Gardner
Patrick Jordan-Smith
Paul Vixie
Perry Lorier
Peter Mott
Ricard Kelly
Sean Davidson
Sid Jones
Simon Lyall
Spiro Harvey