Hi all, Beware, operational content. A couple of hours ago I started noticing weird things getting to stuff overseas. Looks like GGIS have some sort of path flapping that doesn’t care about IP flows. I see RTT to sites in the US jump from ~125ms to ~160ms several times per second, which to me is indicative of going the short and long way around the SXC network, respectively. # hping3 -S -p 80 216.146.46.10 HPING 216.146.46.10 (eth0 216.146.46.10): S set, 40 headers + 0 data bytes len=46 ip=216.146.46.10 ttl=55 DF id=33739 sport=80 flags=SA seq=0 win=65535 rtt=129.1 ms len=46 ip=216.146.46.10 ttl=55 DF id=47972 sport=80 flags=SA seq=1 win=65535 rtt=132.3 ms len=46 ip=216.146.46.10 ttl=53 DF id=48561 sport=80 flags=SA seq=2 win=65535 rtt=159.6 ms len=46 ip=216.146.46.10 ttl=55 DF id=35481 sport=80 flags=SA seq=3 win=65535 rtt=128.1 ms len=46 ip=216.146.46.10 ttl=54 DF id=49791 sport=80 flags=SA seq=4 win=65535 rtt=158.7 ms len=46 ip=216.146.46.10 ttl=55 DF id=36729 sport=80 flags=SA seq=5 win=65535 rtt=132.2 ms len=46 ip=216.146.46.10 ttl=53 DF id=50960 sport=80 flags=SA seq=6 win=65535 rtt=158.8 ms len=46 ip=216.146.46.10 ttl=55 DF id=37839 sport=80 flags=SA seq=7 win=65535 rtt=129.1 ms len=46 ip=216.146.46.10 ttl=55 DF id=38397 sport=80 flags=SA seq=8 win=65535 rtt=129.2 ms len=46 ip=216.146.46.10 ttl=54 DF id=39075 sport=80 flags=SA seq=9 win=65535 rtt=158.8 ms The above example is a bunch of SYN packets so are different L4 flows so it’d be *sort of* reasonable for them to go different ways. However, you see plenty of out of order TCP packets if you do a long running flow (like say trying to load a web page with large objects, such as speedtest.net). Out of order packets means TCP cries a bunch and goes slow. This is the effect I see. Note also the TTL differences and correlation to RTT. I do not see the TTL hopping around when looking at a single TCP stream, though I do see different TTLs on two concurrent streams, so I suspect that’s some other ECMP happening that isn’t a problem. Who knows where that might be, but I don’t believe it’s important. This particular host I have tested to is the speedtest.net main web server. I see the out of order packets when doing a regular speedtest.net test to the Sonic.net endpoint in California (I think San Jose?). Here is what mtr sees when tracing to the same host as above. It is unclear to me why hop 9 doesn’t see the same weird latency. It sees it sometimes, but not all the time. Perhaps there’s something about the way mtr sends packets that means the hashing is different over some sort of aggregate interface/ECMP thing. Maybe that’s where our TTL difference from above happens. I *always* see the frequent latency changes (jitter, I guess) on hop 8. 7.|-- ae1-10.tkbr12.global-gate 0.0% 10 3.3 3.5 2.2 5.3 0.7 8.|-- xe-0-0-7-3.sjbr3.global-g 0.0% 10 130.5 136.1 126.3 163.1 14.2 9.|-- ae0.pabr5.global-gateway. 0.0% 10 154.8 154.7 154.1 155.9 0.5 Here are some more samples that show the differences in later hops, but always hop 8 having jitter: 7.|-- 202.50.232.37 0.0% 10 3.5 3.2 1.5 4.1 0.7 8.|-- 122.56.127.22 0.0% 10 160.2 148.0 127.7 160.2 15.0 9.|-- 122.56.127.25 0.0% 10 158.2 148.6 132.2 166.9 14.3 10.|-- 203.96.120.74 10.0% 10 159.0 159.0 158.2 160.0 0.4 11.|-- 4.28.172.109 0.0% 10 167.6 171.3 166.3 207.7 12.8 7.|-- 202.50.232.37 0.0% 10 2.9 3.3 2.7 4.2 0.3 8.|-- 122.56.127.22 10.0% 10 159.3 147.1 126.6 168.7 17.1 9.|-- 122.56.127.25 10.0% 10 158.7 146.9 132.5 166.3 13.8 10.|-- 203.96.120.74 10.0% 10 159.7 158.8 158.0 159.7 0.0 11.|-- 4.28.172.109 20.0% 10 167.0 167.4 166.8 168.1 0.0 At least two providers have (or are in the process of) logging faults, so I figured it’s appropriate to bring it on to the list. Is anyone else seeing this behaviour? Do you have any feedback from GGIS yet? Have you had any customers complain about poor speeds/user experience? -- Nathan Ward