On 23 April 2015 at 13:48:09, Glen Eustace (geustace(a)godzone.net.nz) wrote: The problem is also intermittent, I have heard that the two Xtra servers are actually LB VIPs in front of a farm of name servers. With the intermittent nature of the issue I wonder whether one server in the farm might be broken/misconfigured, just a thought. I’d be surprised if there was an actual load balancer in the way, though it is entirely possible that there’s some ECMP routes to the servers or something. It would surprise me if they had enough DNS traffic to require such a thing, but, what do I know. There’s a couple of ways to easily validate whether you’re hitting different servers. It’s difficult to prove the negative, but it’s easy to prove the positive (with very good confidence). 1) Look at the TTL the servers offers, it’ll jump around between queries. 2) Ask it for names it has to recurse, and on your name server see where the queries come from, it’ll likely change between queries - though some providers pass recursive queries to a higher level caching server which would mask that. 3) Ask for the hostname "dig chaos txt hostname.bind @<server>” and see if it changes (assuming they offer it). If the customer is on a dynamic IP, get them to reconnect to get a different IP, that might be when you see the change happen - assuming whatever the load sharing function is does it by an L3 hash. If it’s L4 you’d see it changing between queries, which I suspect isn’t happening in your case given how you describe the problem. If any of the above things is true, then there’s a strong chance you’re hitting different servers. If you can isolate it to a specific server (or set of servers), I imagine when you do get in touch with someone about the issue you’ll be able to resolve it much faster. -- Nathan Ward