On 23/04/15 2:54 pm, Nathan Ward wrote:
On 23 April 2015 at 13:48:09, Glen Eustace (geustace(a)godzone.net.nz mailto:geustace(a)godzone.net.nz) wrote:
The problem is also intermittent, I have heard that the two Xtra servers are actually LB VIPs in front of a farm of name servers. With the intermittent nature of the issue I wonder whether one server in the farm might be broken/misconfigured, just a thought.
I’d be surprised if there was an actual load balancer in the way, though it is entirely possible that there’s some ECMP routes to the servers or something. It would surprise me if they had enough DNS traffic to require such a thing, but, what do I know.
We have solid evidence to show that there is a load balancer in front of the IP Spark uses for resolver. NZRS actually has an article that explores that: https://nzrs.net.nz/content/dnssec-validation-spark-nz
There’s a couple of ways to easily validate whether you’re hitting different servers. It’s difficult to prove the negative, but it’s easy to prove the positive (with very good confidence).
1) Look at the TTL the servers offers, it’ll jump around between queries. 2) Ask it for names it has to recurse, and on your name server see where the queries come from, it’ll likely change between queries - though some providers pass recursive queries to a higher level caching server which would mask that.
We followed this methodology and we saw those jumps in the TTL, as well in the validation status of responses. We came to the conclusion there are a set of servers behind the "service address", and some of them validate and some others don't
3) Ask for the hostname "dig chaos txt hostname.bind @<server>” and see if it changes (assuming they offer it).
If the customer is on a dynamic IP, get them to reconnect to get a different IP, that might be when you see the change happen - assuming whatever the load sharing function is does it by an L3 hash. If it’s L4 you’d see it changing between queries, which I suspect isn’t happening in your case given how you describe the problem.
If any of the above things is true, then there’s a strong chance you’re hitting different servers. If you can isolate it to a specific server (or set of servers), I imagine when you do get in touch with someone about the issue you’ll be able to resolve it much faster.
Assuming it's a hiccup in one of the servers in the pool, it won't be possible to positively identify, but by discard. I'm not completely sure the servers respond to hostname.bind queries. At least they don't provide information using NSID. Cheers,
-- Nathan Ward
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
-- Sebastian Castro Technical Research Manager NZRS Ltd. desk: +64 4 495 2337 mobile: +64 21 400535