Fwd: Notification of web harvest & consultation report
Hi all: Many of you were affected by the 2008 web harvest or have expressed an interest in the 2010 harvest so I am sending you the message below about the outcome of our recent consultation on the Options paper on the 2010 whole of domain web harvest. The decisions on the key issues raised are: Notification - The harvest is scheduled to begin on 12 May 2010. There will be a five-week notification period - The Library will use several channels to communicate about the harvest, including its corporate website, the LibraryTechNZ blog, a Twitter account, various mailing lists and forums, and media releases. Robots.txt - In 2008 the Library made the decision to ignore the robots.txt convention. - For the 2010 harvest, where a robots.txt file exists the harvester will honour robots.txt except when downloading images and other elements that are embedded in other web pages. - Website owners can set specific rules for the Library’s harvester, which will have the user agent string: NLNZHarvester2010 - If you have a very restrictive robots.txt file in place already, we would appreciate it if you could provide a more permissive rule for NLNZHarvester2010 to help us capture a complete copy of your website Location of harvester - After consultation with New Zealand telecom vendors we have decided to run the harvest from the United States using the Internet Archive’s hardware and network infrastructure, as we did in 2008. More information about these decisions is available from our website: http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2010 http://www.natlib.govt.nz/catalogues/library-documents/web-harvest-consultat... Thanks to all of you have offered us advice in various fora over the last few months (and years). Gordon ...................... New Zealand web harvest 2010 More information at http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2010 This account is run by Courtney Johnston (Web Manager) and Gordon Paynter (Programme Manager Digitisation)
On 8/04/2010, at 4:30 PM, Gordon Paynter wrote:
- After consultation with New Zealand telecom vendors we have decided to run the harvest from the United States using the Internet Archive’s hardware and network infrastructure, as we did in 2008.
Can you elaborate on this? -- Nathan Ward
On 8/04/2010, at 4:30 PM, Gordon Paynter wrote: - After consultation with New Zealand telecom vendors we have decided to run the harvest from the United States using the Internet Archive's hardware and network infrastructure, as we did in 2008.
Can you elaborate on this?
I think you will find it elaborated on in the document at the URL linked in the original post: http://www.natlib.govt.nz/catalogues/library-documents/web-harvest-consultat... "The next best option, hosting with a telecommunications provider in New Zealand, was not viable for the 2010 harvest on grounds of value-for-money and increased technical complexity and risk. Similar issues arise with the possibility of hosting at the Library itself, or routin from the USA via New Zealand."
On 8/04/2010, at 9:42 PM, Regan Murphy wrote:
On 8/04/2010, at 4:30 PM, Gordon Paynter wrote: - After consultation with New Zealand telecom vendors we have decided to run the harvest from the United States using the Internet Archive's hardware and network infrastructure, as we did in 2008.
Can you elaborate on this?
I think you will find it elaborated on in the document at the URL linked in the original post: http://www.natlib.govt.nz/catalogues/library-documents/web-harvest-consultat...
"The next best option, hosting with a telecommunications provider in New Zealand, was not viable for the 2010 harvest on grounds of value-for-money and increased technical complexity and risk. Similar issues arise with the possibility of hosting at the Library itself, or routin from the USA via New Zealand."
Yeah someone linked me to that. So essentially the argument is, we don't want to pay a small amount for it, so we'll push that (larger) cost on to NZ businesses instead? Was there even any research done in to finding out what the cost would be to NZ businesses? Should a govt. thing like natlib care about that sort of thing? Last I looked at such things, public rate card for colo is 1000/mbit for international capacity. Let's assume most colo customers don't really know how to negotiate that down and are paying that. Most of my customers who's kit I look after are (or were before I came along :-). Domestic is what, 100/mbit? Plus I'm sure they'd just build in to APE to reduce their costs a bunch more. If they don't want to pay for hardware or have to administer it, why don't they do a deal with a certain provider that gives away outbound international transit (and also builds in to as many .nz exchanges as possible) and build a tunnel or HTTP proxy so they reach hosts in NZ over that? This stuff really isn't hard to make it cheaper for everyone. I haven't seen any appealing to the NZNOG list for ideas[1] on how to do this stuff better, and I'm *sure* we'd all have a load[2]. Sure, some might be rubbish but I'm sure you'd get at least a few clever ideas and likely even some people offering to donate things. -- Nathan Ward [1] Not that there hasn't been any, but I can't see any after a quick search of my local archive either. [2] Mail.app thinks I mis-spelled this word. Onya[2] Mail.app.
So essentially the argument is, we don't want to pay a small amount for it, so we'll push that (larger) cost on to NZ businesses instead? Was there even any research done in to finding out what the cost would be to NZ businesses? Should a govt. thing like natlib care about that sort of thing?
Last I looked at such things, public rate card for colo is 1000/mbit for international capacity. Let's assume most colo customers don't really know how to negotiate that down and are paying that. Most of my customers who's kit I look after are (or were before I came along :-). Domestic is what, 100/mbit?
Is the data cost to site owners really such a big issue? More than 97.5% of the harvested sites had less than 100MB of data downloaded and only 477 sites had more than 1GB data downloaded. Of the larger sites, I wonder how many are paying per/MByte instead of per/MBps Refer the table from the original options paper linked http://bit.ly/nlnzwebharvest : Data downloaded Number of hosts Percent of hosts < 1MB 322,951 81.3% 1 to 10MB 43,226 10.9% 10 to 100MB 22,082 5.6% 100 to 1000MB 8,365 2.1% 1 to 10 GB 455 0.1% 10 to 100 GB 22 0.006% Total 397,101 100% -- Regan
Regan Murphy wrote:
So essentially the argument is, we don't want to pay a small amount for it, so we'll push that (larger) cost on to NZ businesses instead? Was there even any research done in to finding out what the cost would be to NZ businesses? Should a govt. thing like natlib care about that sort of thing?
Last I looked at such things, public rate card for colo is 1000/mbit for international capacity. Let's assume most colo customers don't really know how to negotiate that down and are paying that. Most of my customers who's kit I look after are (or were before I came along :-). Domestic is what, 100/mbit?
Is the data cost to site owners really such a big issue? More than 97.5% of the harvested sites had less than 100MB of data downloaded and only 477 sites had more than 1GB data downloaded. Of the larger sites, I wonder how many are paying per/MByte instead of per/MBps
Refer the table from the original options paper linked http://bit.ly/nlnzwebharvest :
Data downloaded Number of hosts Percent of hosts < 1MB 322,951 81.3% 1 to 10MB 43,226 10.9% 10 to 100MB 22,082 5.6% 100 to 1000MB 8,365 2.1% 1 to 10 GB 455 0.1% 10 to 100 GB 22 0.006% Total 397,101 100%
"really such a big issue?" Well, this was last time... http://treenet.co.nz/natlib.png This graph is taken on traffic to the back-end shard server *behind* a CDN buffer cloud/cluster. It is just one of those 92.1% of servers on a <10MB link. NP: for comparison, the Sept spike is a site replication. I image most web hosts have similar piles of deadweight site data that nobody but robots ever visit. I have crossed fingers that the new harvest will at least do If-Modified-Since on the old URLs with last harvests date on stuff like images? AYJ
Um, if it's not supposed to be publicly reachable or it's not supposed to be trawled, turn it off / firewall / robots etc? It seems, again, there's plenty of warning it's coming, seems like a silly excuse to bluster from what i can see. The only thing i'd request from natlib is the ip's / user-agents that will be doing the actual querying ahead of time. On 14/04/10 18:28, TreeNet Admin wrote:
Regan Murphy wrote:
So essentially the argument is, we don't want to pay a small amount for it, so we'll push that (larger) cost on to NZ businesses instead? Was there even any research done in to finding out what the cost would be to NZ businesses? Should a govt. thing like natlib care about that sort of thing?
Last I looked at such things, public rate card for colo is 1000/mbit for international capacity. Let's assume most colo customers don't really know how to negotiate that down and are paying that. Most of my customers who's kit I look after are (or were before I came along :-). Domestic is what, 100/mbit?
Is the data cost to site owners really such a big issue? More than 97.5% of the harvested sites had less than 100MB of data downloaded and only 477 sites had more than 1GB data downloaded. Of the larger sites, I wonder how many are paying per/MByte instead of per/MBps
Refer the table from the original options paper linked http://bit.ly/nlnzwebharvest :
Data downloaded Number of hosts Percent of hosts < 1MB 322,951 81.3% 1 to 10MB 43,226 10.9% 10 to 100MB 22,082 5.6% 100 to 1000MB 8,365 2.1% 1 to 10 GB 455 0.1% 10 to 100 GB 22 0.006% Total 397,101 100%
"really such a big issue?" Well, this was last time... http://treenet.co.nz/natlib.png
This graph is taken on traffic to the back-end shard server *behind* a CDN buffer cloud/cluster. It is just one of those 92.1% of servers on a <10MB link.
NP: for comparison, the Sept spike is a site replication.
I image most web hosts have similar piles of deadweight site data that nobody but robots ever visit. I have crossed fingers that the new harvest will at least do If-Modified-Since on the old URLs with last harvests date on stuff like images?
AYJ _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
-- *Leon Strong *| Technical Engineer *DDI:* +64 9 950 2203 *Fax:* +64 9 302 0518 *Mobile:* +64 21 0202 8870 *Freephone:* 0800 SMX SMX (769 769) Level 15, 19 Victoria Street, Auckland, New Zealand | SMX Ltd | smx.co.nz http://smx.co.nz SMX | Business Email Specialists The information contained in this email and any attachments is confidential. If you are not the intended recipient then you must not use, disseminate, distribute or copy any information contained in this email or any attachments. If you have received this email in error or you are not the originally intended recipient please contact SMX immediately and destroy this email. ______________________________________________________________________________ This email has been scrubbed for your protection by SMX. For more information visit http://smxemail.com ______________________________________________________________________________
Hi all: Information abut the user agent is available from http://bit.ly/nlnzwebharvest and we don't yet know what IP addresses will be used. We have no plans to use the If-Modified-Since (or Etag, or similar approaches) for comparison with the 2008 harvest. If you have concerns about how the crawler may behave on specific websites, feel free to email us directly at web-harvest-2010(a)natlib.govt.nz or get in touch via our feedback form. Thanks, Gordon
Leon Strong
15/04/10 11:51 a.m. >>> Um, if it's not supposed to be publicly reachable or it's not supposed to be trawled, turn it off / firewall / robots etc?
It seems, again, there's plenty of warning it's coming, seems like a silly excuse to bluster from what i can see. The only thing i'd request from natlib is the ip's / user-agents that will be doing the actual querying ahead of time. On 14/04/10 18:28, TreeNet Admin wrote:
Regan Murphy wrote:
So essentially the argument is, we don't want to pay a small amount for it, so we'll push that (larger) cost on to NZ businesses instead? Was there even any research done in to finding out what the cost would be to NZ businesses? Should a govt. thing like natlib care about that sort of thing?
Last I looked at such things, public rate card for colo is 1000/mbit for international capacity. Let's assume most colo customers don't really know how to negotiate that down and are paying that. Most of my customers who's kit I look after are (or were before I came along :-). Domestic is what, 100/mbit?
Is the data cost to site owners really such a big issue? More than 97.5% of the harvested sites had less than 100MB of data downloaded and only 477 sites had more than 1GB data downloaded. Of the larger sites, I wonder how many are paying per/MByte instead of per/MBps
Refer the table from the original options paper linked http://bit.ly/nlnzwebharvest :
Data downloaded Number of hosts Percent of hosts < 1MB 322,951 81.3% 1 to 10MB 43,226 10.9% 10 to 100MB 22,082 5.6% 100 to 1000MB 8,365 2.1% 1 to 10 GB 455 0.1% 10 to 100 GB 22 0.006% Total 397,101 100%
"really such a big issue?" Well, this was last time... http://treenet.co.nz/natlib.png
This graph is taken on traffic to the back-end shard server *behind* a CDN buffer cloud/cluster. It is just one of those 92.1% of servers on a <10MB link.
NP: for comparison, the Sept spike is a site replication.
I image most web hosts have similar piles of deadweight site data that nobody but robots ever visit. I have crossed fingers that the new harvest will at least do If-Modified-Since on the old URLs with last harvests date on stuff like images?
AYJ _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
-- *Leon Strong *| Technical Engineer *DDI:* +64 9 950 2203 *Fax:* +64 9 302 0518 *Mobile:* +64 21 0202 8870 *Freephone:* 0800 SMX SMX (769 769) Level 15, 19 Victoria Street, Auckland, New Zealand | SMX Ltd | smx.co.nz http://smx.co.nz SMX | Business Email Specialists The information contained in this email and any attachments is confidential. If you are not the intended recipient then you must not use, disseminate, distribute or copy any information contained in this email or any attachments. If you have received this email in error or you are not the originally intended recipient please contact SMX immediately and destroy this email. ______________________________________________________________________________ This email has been scrubbed for your protection by SMX. For more information visit http://smxemail.com ______________________________________________________________________________
participants (5)
-
Gordon Paynter
-
Leon Strong
-
Nathan Ward
-
Regan Murphy
-
TreeNet Admin