Um, if it's not supposed to be publicly reachable or it's not supposed to be trawled, turn it off / firewall / robots etc?

It seems, again, there's plenty of warning it's coming, seems like a silly excuse to bluster from what i can see.

The only thing i'd request from natlib is the ip's / user-agents that will be doing the actual querying ahead of time.

On 14/04/10 18:28, TreeNet Admin wrote:
Regan Murphy wrote:
So essentially the argument is, we don't want to pay a small amount for it, so we'll push that (larger) cost on to NZ businesses instead?
Was there even any research done in to finding out what the cost would be to NZ businesses? Should a govt. thing like natlib care about that sort of thing?

Last I looked at such things, public rate card for colo is 1000/mbit for international capacity. Let's assume most colo customers don't really know how to negotiate that down and are paying that. Most of my customers who's kit I look after are (or were before I came along :-).
Domestic is what, 100/mbit?

Is the data cost to site owners really such a big issue?  More than 97.5% of the harvested sites had less than 100MB of data downloaded and only 477 sites had more than 1GB data downloaded.  Of the larger sites, I wonder how many are paying per/MByte instead of per/MBps

Refer the table from the original options paper linked http://bit.ly/nlnzwebharvest :

Data downloaded    Number of hosts    Percent of hosts
< 1MB                           322,951     81.3%
1 to 10MB                         43,226     10.9%
10 to 100MB                         22,082     5.6%
100 to 1000MB                           8,365     2.1%
1 to 10 GB                             455     0.1%
10 to 100 GB                               22     0.006%
Total                           397,101     100%

"really such a big issue?"
Well, this was last time...
  http://treenet.co.nz/natlib.png

This graph is taken on traffic to the back-end shard server *behind* a CDN buffer cloud/cluster. It is just one of those 92.1% of servers on a <10MB link.

NP: for comparison, the Sept spike is a site replication.

I image most web hosts have similar piles of deadweight site data that nobody but robots ever visit. I have crossed fingers that the new harvest will at least do If-Modified-Since on the old URLs with last harvests date on stuff like images?

AYJ
_______________________________________________
NZNOG mailing list
NZNOG@list.waikato.ac.nz
http://list.waikato.ac.nz/mailman/listinfo/nznog


--
Leon Strong | Technical Engineer

Leon Strong | Technical Engineer
DDI: +64 9 950 2203 Fax: +64 9 302 0518
Mobile: +64 21 0202 8870 Freephone: 0800 SMX SMX (769 769)
Level 15, 19 Victoria Street, Auckland, New Zealand | SMX Ltd | smx.co.nz
SMX | Business Email Specialists
The information contained in this email and any attachments is confidential. If you are not
the intended recipient then you must not use, disseminate, distribute or copy any information
contained in this email or any attachments. If you have received this email in error or you
are not the originally intended recipient please contact SMX immediately and destroy this email.

This email has been scrubbed for your protection by SMX. For more information visit smxemail.com