One of the techs at the Library main branch -- let's call him Neo -- thought that his Internet access was going way too slow today. He left both Pat and me voicemail; both of us filed his messages under "not an real outage, will get to you later today". So, deciding to take matters into his own hands, around 1:25 this afternoon Neo went rummaging around in the main wiring closet. Now, in the libraries we provide the network and the phones, but they take care of their own PCs and servers. Neo's not a bad guy, but he's a PC-tech who's struggling to grow into a server tech. I know network engineers. I've served with network engineers. Network Engineers have been my friends. Neo is no network engineer.
Neo, apparently not understanding that that unplugged cables are off and not slow, spotted a random cable and plugged it in to the ATM switch thinking that, I don't know, more plugged in cables are more faster. Neo created a bridging loop. A bridge loop is a cross between a roach motel and a swirling vortex of doom. Packets go in, and not only do they not come out, they keep going around and around and around. Indefinitely. Each new packet entering the loop (coming from a PC or server) only adds to the infinitely repeated maelstrom of data. (If you're thinking "those idiots, why don't they have Spanning Tree turned on?" There's a reason, though I'm revisiting it in the library's case.)
The router that connects the library subnet to the rest of the world sits over at the Board of Ed office. It also connects the data side of the network to the IP telephony side of the network, and to the subnet that has our primary Internet firewall and primary DNS servers. This normally very reliable trooper of a box got shit hammered by Neo's bridging loop. It was getting 20,000 packets per second of just broadcast traffic, let alone repeats of the traffic it was supposed to route. When I looked in its event log I saw messages the likes of which I had never seen before. This, to me, is saying something -- I've been working with Nortel routers since they were originally Wellfleet routers back in '95. Every few minutes the poor beast would hit resource exhaustion and restart. Imagine for a moment trying to drive while 20,000 people shouted continuously into your left ear. You'd hit a tree too.
What this meant was 1) the Library couldn't get out to the outside world, 2) no one anywhere on the network could access the Internet because the router overload/crashes prevented them from reaching our DNS servers, and 3) Windows 2000 / Windows XP users couldn't sign on to the domain, also because the DNS servers were unreachable.
At 1:35 this afternoon my pager started vibrating. My NMS server had noticed that things we're right with the voice networks and our Internet servers, and it managed to get a few emailed pages out before the router went into cardiac arrest. Shortly thereafter my cell phone started vibrating. Two or three people got the "I just found out, I'm just starting to look at it, I'll call you back". Seeing where the trouble was centered, and having Neo's voicemail message in the back of my mind, I headed across the street to have a look at that router.
In the past few years of dealing with this network, my troubleshooting technique has become one where I will take increasingly drastic actions to stabilize as much of the network for the most people as rapidly as possible. Once things are stable for the majority of people, we work on isolating the cause and restoring service to the people who have been cut off. Thankfully, my first action this afternoon -- isolating the Library from the rest of the network -- was the right action to get the router stabilized and restore service for everybody else. This came at about fifty minutes after the first sign of trouble. From there it was a process of elimination (using a traffic sniffer to gage traffic levels) to find the switch port at the Library that was causing all of the trouble. Pat and I got the problem port nailed down in another forty minutes.
Once we had that port administratively disabled, we un-isolated the Library and let them back on to the network. Neo called us just as we had turned the Library back on. That was one of those fun calls where, knowing you have them dead to rights, you get to ask the other person "so, what exactly were you doing?" While doing some related cleanup work later at the Library Pat , Neo and I bumped into Neo's boss. I gave Neo the job of explaining to his boss why the Library had been down for most of the afternoon. His boss was none too pleased with his underling's actions and promised that "appropriate actions" would be taken.
Long term we need to take a look at where our DNS servers are located. It makes sense to relocate one of them to a different subnet -- one that does not rely on this particular router to reach the rest of the network. We will also have to think about being more like true Network Nazis to the Library staff (e.g. no open ports on the ATM switch left enabled for them to mess with). Given the mess that they've made of their wiring closet, I also think that I'll be setting a date for them to join us in a little house cleaning. Yep, I think that it's past due for the sheriff to crack down on this part of town.