On 23 April 2015 at 13:48:09, Glen Eustace (geustace@godzone.net.nz) wrote:
The problem is also intermittent, I have heard that the two Xtra servers are actually LB VIPs in front of a farm of name servers. With the intermittent nature of the issue I wonder whether one server in the farm might be broken/misconfigured, just a thought.

I���d be surprised if there was an actual load balancer in the way, though it is entirely possible that there���s some ECMP routes to the servers or something. It would surprise me if they had enough DNS traffic to require such a thing, but, what do I know.

There���s a couple of ways to easily validate whether you���re hitting different servers. It���s difficult to prove the negative, but it���s easy to prove the positive (with very good confidence).

1) Look at the TTL the servers offers, it���ll jump around between queries.
2) Ask it for names it has to recurse, and on your name server see where the queries come from, it���ll likely change between queries - though some providers pass recursive queries to a higher level caching server which would mask that.
3) Ask for the hostname "dig chaos txt hostname.bind @<server>��� and see if it changes (assuming they offer it).

If the customer is on a dynamic IP, get them to reconnect to get a different IP, that might be when you see the change happen - assuming whatever the load sharing function is does it by an L3 hash. If it���s L4 you���d see it changing between queries, which I suspect isn���t happening in your case given how you describe the problem.

If any of the above things is true, then there���s a strong chance you���re hitting different servers. If you can isolate it to a specific server (or set of servers), I imagine when you do get in touch with someone about the issue you���ll be able to resolve it much faster.

-- 
Nathan Ward