Need assistance troubleshooting a DNS issue
From about October/November last year we have been getting the odd call from some of our customers to report that our servers are ‘not found’. So far, each report has been from an Xtra broadband user. When investigating these reports we have found that the servers were fine and DNS lookups from various sources (except Xtra) were as well.
Last night the issue happened to a member of the family so I was able to jump onto their computer using Teamviewer and do some more thorough diagnostics. My results were; Name servers in use: ns1.xtra.co.nz http://ns1.xtra.co.nz/ and ns2.xtra.co.nz http://ns2.xtra.co.nz/ Name servers pingable from PC: Yes Does smtp.godzone.net.nz http://smtp.godzone.net.nz/ resolve: No, times out Do other resources in our DNS resolve: No, time out Do resources in other DNS servers resolve: Yes Are all 4 of our name servers pingable from PC: Yes, 2 in NZ and 2 offshore Use nslookup to query our name servers directly, does smtp.godzone.net.nz http://smtp.godzone.net.nz/ resolve: Yes Use nslookup to query our name servers using 8.8.8.8, does smtp.godzone.net.nz http://smtp.godzone.net.nz/ resolve: Yes To fix the issue on the PC, I manually set the name servers to 8.8.8.8 and 8.8.4.4, all our services were then resolvable and useable. So, whats going on ? How are our servers different ? Well, we are one of the few ISPs that are using DNSSec to sign zones. Is DNSSec broken ? Not according to dnsviz.net http://dnsviz.net/ The problem is also intermittent, I have heard that the two Xtra servers are actually LB VIPs in front of a farm of name servers. With the intermittent nature of the issue I wonder whether one server in the farm might be broken/misconfigured, just a thought. I have tried, without success to contact appropriate people at Xtra to either comment or assist and have failed to get past the level one helpdesk people. Their only response being “Sorry we can’t help you”. I am not saying this is an issue with Xtra’s internal recursive DNS servers but so far I have been unable to replicate the issue and have had no reports from any of our other customers using alternative broadband suppliers. I have run out of ideas now on how to continue to investigate this and just changing to Google’s DNS servers might work but isn’t a great solution. If anyone has any suggestions I’d appreciate hearing from you. Glen. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Glen and Rosanne Eustace, GodZone Internet Services, a division of AGRE Enterprises Ltd., P.O. Box 8020, Palmerston North, New Zealand 4446 Ph: +64 6 357 8168 tel:%2B64%206%20357%208168, Fax: +64 6 357 8165 tel:%2B64%206%20357%208165, Mob: +64 27 542 4015 tel:%2B64%2027%20542%204015 "Specialising in providing low-cost professional Internet Services since 1997"
On 23 April 2015 at 13:48:09, Glen Eustace (geustace(a)godzone.net.nz) wrote: The problem is also intermittent, I have heard that the two Xtra servers are actually LB VIPs in front of a farm of name servers. With the intermittent nature of the issue I wonder whether one server in the farm might be broken/misconfigured, just a thought. I’d be surprised if there was an actual load balancer in the way, though it is entirely possible that there’s some ECMP routes to the servers or something. It would surprise me if they had enough DNS traffic to require such a thing, but, what do I know. There’s a couple of ways to easily validate whether you’re hitting different servers. It’s difficult to prove the negative, but it’s easy to prove the positive (with very good confidence). 1) Look at the TTL the servers offers, it’ll jump around between queries. 2) Ask it for names it has to recurse, and on your name server see where the queries come from, it’ll likely change between queries - though some providers pass recursive queries to a higher level caching server which would mask that. 3) Ask for the hostname "dig chaos txt hostname.bind @<server>” and see if it changes (assuming they offer it). If the customer is on a dynamic IP, get them to reconnect to get a different IP, that might be when you see the change happen - assuming whatever the load sharing function is does it by an L3 hash. If it’s L4 you’d see it changing between queries, which I suspect isn’t happening in your case given how you describe the problem. If any of the above things is true, then there’s a strong chance you’re hitting different servers. If you can isolate it to a specific server (or set of servers), I imagine when you do get in touch with someone about the issue you’ll be able to resolve it much faster. -- Nathan Ward
On 23/04/15 2:54 pm, Nathan Ward wrote:
On 23 April 2015 at 13:48:09, Glen Eustace (geustace(a)godzone.net.nz mailto:geustace(a)godzone.net.nz) wrote:
The problem is also intermittent, I have heard that the two Xtra servers are actually LB VIPs in front of a farm of name servers. With the intermittent nature of the issue I wonder whether one server in the farm might be broken/misconfigured, just a thought.
I’d be surprised if there was an actual load balancer in the way, though it is entirely possible that there’s some ECMP routes to the servers or something. It would surprise me if they had enough DNS traffic to require such a thing, but, what do I know.
We have solid evidence to show that there is a load balancer in front of the IP Spark uses for resolver. NZRS actually has an article that explores that: https://nzrs.net.nz/content/dnssec-validation-spark-nz
There’s a couple of ways to easily validate whether you’re hitting different servers. It’s difficult to prove the negative, but it’s easy to prove the positive (with very good confidence).
1) Look at the TTL the servers offers, it’ll jump around between queries. 2) Ask it for names it has to recurse, and on your name server see where the queries come from, it’ll likely change between queries - though some providers pass recursive queries to a higher level caching server which would mask that.
We followed this methodology and we saw those jumps in the TTL, as well in the validation status of responses. We came to the conclusion there are a set of servers behind the "service address", and some of them validate and some others don't
3) Ask for the hostname "dig chaos txt hostname.bind @<server>” and see if it changes (assuming they offer it).
If the customer is on a dynamic IP, get them to reconnect to get a different IP, that might be when you see the change happen - assuming whatever the load sharing function is does it by an L3 hash. If it’s L4 you’d see it changing between queries, which I suspect isn’t happening in your case given how you describe the problem.
If any of the above things is true, then there’s a strong chance you’re hitting different servers. If you can isolate it to a specific server (or set of servers), I imagine when you do get in touch with someone about the issue you’ll be able to resolve it much faster.
Assuming it's a hiccup in one of the servers in the pool, it won't be possible to positively identify, but by discard. I'm not completely sure the servers respond to hostname.bind queries. At least they don't provide information using NSID. Cheers,
-- Nathan Ward
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
-- Sebastian Castro Technical Research Manager NZRS Ltd. desk: +64 4 495 2337 mobile: +64 21 400535
On 23/04/2015, at 4:41 pm, Sebastian Castro
wrote: We have solid evidence to show that there is a load balancer in front of the IP Spark uses for resolver.
Having now got access to a resolver in an Xtra network, there definitely seems to be an issue, I can attempt 10 queries of the same host and get (randomly) 7 successes and 3 failures. The Recursive farm would appear to have at least 8 members, rns1-8.xtra.co.nz, I have the IP addresses of 4 of them (tcpdump :) So at this stage my guess is one of them is not so healthy. But I still can’t find a communication channel to discuss the situation with anyone at Xtra :-( -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Glen Eustace, GodZone Internet Services, a division of AGRE Enterprises Ltd., P.O. Box 8020, Palmerston North, New Zealand 4446 Ph +64 6 357 8168 Fax +64 6 357 8165 Mob +64 27 542 4015 “Specialising in providing low-cost professional Internet since 1997"
Hi Glen, I have reached out to my Spark contacts to find someone for you to connect with. Will provide offline when it comes through. Kind regards David David Morrison Chief Marketing Officer NZRS Ltd P +64 49316973 M +64 274366182 F +64 49316979 E david(a)nzrs.net.nz W www.nzrs.net.nz S david.morrisonnz T @dotnz PGP 7A38 2F84 C7DF 8FF2 34F8 B4F2 BC54 10AE 2501 6600 Read our latest research into the .nz domain name space. http://www.getyourselfonline.nz/domain-name-perceptions-2014
On 24/04/2015, at 6:56 am, Glen Eustace
wrote: On 23/04/2015, at 4:41 pm, Sebastian Castro
mailto:sebastian(a)nzrs.net.nz> wrote: We have solid evidence to show that there is a load balancer in front of the IP Spark uses for resolver.
Having now got access to a resolver in an Xtra network, there definitely seems to be an issue, I can attempt 10 queries of the same host and get (randomly) 7 successes and 3 failures.
The Recursive farm would appear to have at least 8 members, rns1-8.xtra.co.nz http://rns1-8.xtra.co.nz/, I have the IP addresses of 4 of them (tcpdump :)
So at this stage my guess is one of them is not so healthy. But I still can’t find a communication channel to discuss the situation with anyone at Xtra :-(
-- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Glen Eustace, GodZone Internet Services, a division of AGRE Enterprises Ltd., P.O. Box 8020, Palmerston North, New Zealand 4446 Ph +64 6 357 8168 Fax +64 6 357 8165 Mob +64 27 542 4015
“Specialising in providing low-cost professional Internet since 1997"
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
Hi There,
I can get someone to take a look into this, or at least contact you.
They may very well ask you to do some traces using dig to see what is going
on.
Sebastian is correct there are load balancers in the mix.
I would suspect that you are correct, one or more of the servers may have
an issue.
Do you get any other failures for signed zones?
Paul
On Fri, 24 Apr 2015 at 10:32 David Morrison
Hi Glen,
I have reached out to my Spark contacts to find someone for you to connect with. Will provide offline when it comes through.
Kind regards
David
David Morrison Chief Marketing Officer NZRS Ltd
P +64 49316973 M +64 274366182 F +64 49316979 E david(a)nzrs.net.nz W www.nzrs.net.nz S david.morrisonnz T @dotnz
PGP 7A38 2F84 C7DF 8FF2 34F8 B4F2 BC54 10AE 2501 6600
Read our latest research into the .nz domain name space. http://www.getyourselfonline.nz/domain-name-perceptions-2014
On 24/04/2015, at 6:56 am, Glen Eustace
wrote: On 23/04/2015, at 4:41 pm, Sebastian Castro
wrote: We have solid evidence to show that there is a load balancer in front of the IP Spark uses for resolver.
Having now got access to a resolver in an Xtra network, there definitely seems to be an issue, I can attempt 10 queries of the same host and get (randomly) 7 successes and 3 failures.
The Recursive farm would appear to have at least 8 members, rns1-8.xtra.co.nz, I have the IP addresses of 4 of them (tcpdump :)
So at this stage my guess is one of them is not so healthy. But I still can’t find a communication channel to discuss the situation with anyone at Xtra :-(
-- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Glen Eustace, GodZone Internet Services, a division of AGRE Enterprises Ltd., P.O. Box 8020, Palmerston North, New Zealand 4446 Ph +64 6 357 8168 Fax +64 6 357 8165 Mob +64 27 542 4015
“Specialising in providing low-cost professional Internet since 1997"
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
On 24 April 2015 at 13:34:51, paul tinson (paul.tinson(a)gmail.com(mailto:paul.tinson(a)gmail.com)) wrote:
Sebastian is correct there are load balancers in the mix.
As in an F5 or something? Can I ask why? This is widely accepted as a really bad way to build a recursive DNS solution (or an authoritative DNS solution). (Cue Roland Dobbins.) -- Nathan Ward
On Fri, 24 Apr 2015, Nathan Ward wrote:
(paul.tinson(a)gmail.com(mailto:paul.tinson(a)gmail.com)) wrote:
Sebastian is correct there are load balancers in the mix.
As in an F5 or something?
Can I ask why? This is widely accepted as a really bad way to build a recursive DNS solution (or an authoritative DNS solution).
It was a step up from having a single physical server[1] completely handle all requests for that IP. Also remember this is 10 years ago when anycast was weird science-fiction technology and network and server guys never meet in same room together[2] let alone had servers talking OSPF to routing devices[3] [1] There was a machine called "alien" and a machine called "terminator". [2] Not everything has changed [3] https://www.nanog.org/meetings/nanog34/presentations/abley.nameservers.pdf -- Simon Lyall | Very Busy | Web: http://www.simonlyall.com/ "To stay awake all night adds a day to your life" - Stilgar
I have made contact with the appropriate Spark internal resource, my thanks to Peter Lambrechtsen. We believe we have identified the issue, am waiting for Spark to determine how/if they can provide a resolution. Until I hear back I will leave it at that. Glen.
Very good. Peter is well able to help get the issue to the right people.
I wouldnt comment on what it is to that detail, but i take your point
Nathan, its certainly something we are aware of.
Paul
On Fri, 24 Apr 2015 at 13:55 Glen Eustace
I have made contact with the appropriate Spark internal resource, my thanks to Peter Lambrechtsen.
We believe we have identified the issue, am waiting for Spark to determine how/if they can provide a resolution.
Until I hear back I will leave it at that.
Glen.
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
Out of interest, was the issue related to anything in your zone (AAAA or it
being signed) or was it with Spark?
Thanks,
Anand
On 24 April 2015 at 02:55, Glen Eustace
I have made contact with the appropriate Spark internal resource, my thanks to Peter Lambrechtsen.
We believe we have identified the issue, am waiting for Spark to determine how/if they can provide a resolution.
Until I hear back I will leave it at that.
Glen.
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
On 26/04/2015, at 8:00 am, Anand Kumria
wrote: Out of interest, was the issue related to anything in your zone (AAAA or it being signed) or was it with Spark?
As far as we are able to determine it is definitely Spark, they have told me that they are working on a solution and at this time I am waiting to be advised of the outcome. Our current work around for our customers (using Xtra’s recursive name servers) having issues is to get them to temporarily change their DNS servers to Google’s i.e. 8.8.8.8 and 8.8.4.4, doing so enables our servers to be ‘found’ again. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Glen Eustace, GodZone Internet Services, a division of AGRE Enterprises Ltd., P.O. Box 8020, Palmerston North, New Zealand 4446 Ph +64 6 357 8168 Fax +64 6 357 8165 Mob +64 27 542 4015 “Specialising in providing low-cost professional Internet since 1997"
participants (8)
-
Anand Kumria
-
Brian Gibbons
-
David Morrison
-
Glen Eustace
-
Nathan Ward
-
paul tinson
-
Sebastian Castro
-
Simon Lyall