Hi all: Information abut the user agent is available from http://bit.ly/nlnzwebharvest and we don't yet know what IP addresses will be used. We have no plans to use the If-Modified-Since (or Etag, or similar approaches) for comparison with the 2008 harvest. If you have concerns about how the crawler may behave on specific websites, feel free to email us directly at web-harvest-2010(a)natlib.govt.nz or get in touch via our feedback form. Thanks, Gordon
Leon Strong
15/04/10 11:51 a.m. >>> Um, if it's not supposed to be publicly reachable or it's not supposed to be trawled, turn it off / firewall / robots etc?
It seems, again, there's plenty of warning it's coming, seems like a silly excuse to bluster from what i can see. The only thing i'd request from natlib is the ip's / user-agents that will be doing the actual querying ahead of time. On 14/04/10 18:28, TreeNet Admin wrote:
Regan Murphy wrote:
So essentially the argument is, we don't want to pay a small amount for it, so we'll push that (larger) cost on to NZ businesses instead? Was there even any research done in to finding out what the cost would be to NZ businesses? Should a govt. thing like natlib care about that sort of thing?
Last I looked at such things, public rate card for colo is 1000/mbit for international capacity. Let's assume most colo customers don't really know how to negotiate that down and are paying that. Most of my customers who's kit I look after are (or were before I came along :-). Domestic is what, 100/mbit?
Is the data cost to site owners really such a big issue? More than 97.5% of the harvested sites had less than 100MB of data downloaded and only 477 sites had more than 1GB data downloaded. Of the larger sites, I wonder how many are paying per/MByte instead of per/MBps
Refer the table from the original options paper linked http://bit.ly/nlnzwebharvest :
Data downloaded Number of hosts Percent of hosts < 1MB 322,951 81.3% 1 to 10MB 43,226 10.9% 10 to 100MB 22,082 5.6% 100 to 1000MB 8,365 2.1% 1 to 10 GB 455 0.1% 10 to 100 GB 22 0.006% Total 397,101 100%
"really such a big issue?" Well, this was last time... http://treenet.co.nz/natlib.png
This graph is taken on traffic to the back-end shard server *behind* a CDN buffer cloud/cluster. It is just one of those 92.1% of servers on a <10MB link.
NP: for comparison, the Sept spike is a site replication.
I image most web hosts have similar piles of deadweight site data that nobody but robots ever visit. I have crossed fingers that the new harvest will at least do If-Modified-Since on the old URLs with last harvests date on stuff like images?
AYJ _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
-- *Leon Strong *| Technical Engineer *DDI:* +64 9 950 2203 *Fax:* +64 9 302 0518 *Mobile:* +64 21 0202 8870 *Freephone:* 0800 SMX SMX (769 769) Level 15, 19 Victoria Street, Auckland, New Zealand | SMX Ltd | smx.co.nz http://smx.co.nz SMX | Business Email Specialists The information contained in this email and any attachments is confidential. If you are not the intended recipient then you must not use, disseminate, distribute or copy any information contained in this email or any attachments. If you have received this email in error or you are not the originally intended recipient please contact SMX immediately and destroy this email. ______________________________________________________________________________ This email has been scrubbed for your protection by SMX. For more information visit http://smxemail.com ______________________________________________________________________________