We've had a lot of rain in the past few days, so my instant suspect was a problem with the DSL circuit that alpha and beta share for their Internet access. Pings from home to the servers were dropping 50% of the time. But, as I started to dig deeper, that hypothesis started losing ground. They weren't responding to SSH, and there was an odd period of high traffic on the MRTG traces for both servers. I began to suspect that the two servers had been compromised. (In other works, hacked! cracked! pwned!)
Thankfully that turned out to be wrong as well. I finally managed to SSH into Alpha and Beta from another host. A quick check of logins, active processes, and open network ports showed that both servers were exactly as they should be.
I turned my attention back to the possibility of a network problem, but things weren't adding up. Normally, when you are having connection problems, the problem is in what telco types call the last mile or the local loop meaning that circuit at the very edge of the carrier's network to you. Yet, both my cable service and the DSL service at Alpha's and Beta's undisclosed secure location appeared to be fine. The problem only manifested itself when one site tried to talk to the other. A traceroute from here to there brought the real problem to the light:
Microsoft Windows 2000 [Version 5.00.2195] (C) Copyright 1985-2000 Microsoft Corp. C:\Documents and Settings\netcurmudgeon>tracert alpha.houseofhum.com Tracing route to gpip.org [184.108.40.206] over a maximum of 30 hops: 1 7 ms 6 ms 8 ms 10.4.40.1 2 7 ms 7 ms 7 ms glstsysc01-gex0102000.ct.ri.cox.net [220.127.116.11] 3 10 ms 11 ms 11 ms provsysj01-atm020311.rd.ri.cox.net [18.104.22.168] 4 10 ms 10 ms 12 ms provdsrj01-ge600.rd.ri.cox.net [22.214.171.124] 5 12 ms 10 ms 12 ms provbbrj01-ge020.rd.ri.cox.net [126.96.36.199] 6 17 ms 15 ms 15 ms NYRKBBRJ01-so000.R2.ny.cox.net [188.8.131.52] 7 17 ms 16 ms 16 ms 184.108.40.206 8 22 ms 19 ms 20 ms 220.127.116.11 9 28 ms 35 ms 24 ms mrfdbbrj02-ge030.rd.dc.cox.net [18.104.22.168] 10 24 ms 23 ms 23 ms ashbbbrj01-pos020100.r2.as.cox.net [22.214.171.124] 11 21 ms * * 126.96.36.199 12 23 ms * * sp0-4-ASBNVAAS.broadwing.com [188.8.131.52] 13 * * 25 ms serial2-0-0.e1.nwrk.broadwing.net [184.108.40.206] 14 * * * Request timed out. 15 28 ms 27 ms * p6-0.c0.nwyk.broadwing.net [220.127.116.11] 16 * * * Request timed out. 17 25 ms 27 ms * 18.104.22.168 18 * 26 ms * hartford.atm.ntplx.net [22.214.171.124] 19 * * * Request timed out. 20 43 ms * * ip-65-75-17-31.ct.dsl.ntplx.com [126.96.36.199] 21 * 40 ms * ip-65-75-17-31.ct.dsl.ntplx.com [188.8.131.52] 22 * * * Request timed out. 23 39 ms 40 ms * ip-65-75-17-31.ct.dsl.ntplx.com [184.108.40.206] 24 * * 38 ms ip-65-75-17-31.ct.dsl.ntplx.com [220.127.116.11] Trace complete. C:\Documents and Settings\netcurmudgeon>
There, right in the middle of the trip we start losing packets (the red asteriks). Further poking showed that whatever router has the IP address 18.104.22.168 was (still is) dropping half of the traffic that gets to it. The address doesn't resolve to a name, so I can't tell if it's Cox's or Broadwing's problem, but right at the border of their networks something is amis. Hopefully some groggy geek or geekette has been paged in and is looking at it.
Hey, from my perspective at least I'm not owned!