Of course, not really, or this wouldn't be much of a story. There was some random telephonic weirdness over the past couple of days, and then last night *WHUMPH*, the phones all across the building started acting up. ...Couldn't see the Call Manager servers, couldn't call outside, dropped calls, that sort of stuff. Remembering last week's bit of trouble, I started with cutting off the basement phones from the rest of the network. That shored things right up for everyone else. Then, with Sean (one of the phone guys) working with me, we started turning things back on and watching for the problem to resurface. By lunch we had narrowed it down to the six devices plugged into the third switch. After lunch I narrowed that further to a phone.
Except that it wasn't one phone. It was two. On the desk of one of the PC techs there was a Cisco 7960g IP phone plugged into the jack on the wall. Next to it was another 7960g piggy-backed off the first. The first phone gets its power from the switch in the network closet. Second phones have to be powered by an external "power brick" that plugs into the wall. This one wasn't; the power brick was plugged into the phone, but the brick's cord was just laying on the floor. That's damn odd I thought, and I scooped up the whole assembly and took it into my office.
Sean and I grabbed a spare switch and started doing some testing in my office. We plugged a phone into the switch, my laptop into the switch, and the switch into the wall. All was good. The phone worked fine, a continuous ping running on my laptop to one of the servers elsewhere on the network was fine, and the switch itself was showing normal messages on the console session I had running. Then we plugged the second (still unpowered) phone into the back of the first phone. ZOT! Pings started dropping. The first phone reset itself. The switch started throwing error messages to the console ("%RTD-1-ADDR_FLAP: FastEthernet0/2 relearning...").
Disconnecting the second phone from the first cleared the problem. Powering up the second phone cleared the problem. We tried this several times to make sure we could reproduce the problem on-demand. Sure enough, we could. Then we tried different permutations (using an original 7960 and a 7960g, two original 7960s, etc) and found that only the combination of two 7960g phones produced the problem.
Our best explanation is that the underlying cause has to do with how Cisco handles sending power from the switch to the phone. The Cisco switches send a constant (and presumably low-power) "tone" down the line. Cisco IP phones have relays that loop the transmit and receive wires when the phone is off. Plug the phone in and the switch gets back the "tone" it's sending. From that, it knows that there's a device in need of power at the other end, so it turns on the juice. The power clicks the relays over and the phone stops looping-back the line and starts acting like a "normal" network device. Presumably, when you piggyback a second phone off of a phone, the power from the power brick kicks those relays over, and the second phone operates normally.
But what if there's no power coming from the brick? The first phone doesn't supply power, so the connection stays looped. With the connection looped, traffic coming to the first phone can effectively get repeated back into the network, resulting in chaotic behavior as the switches see traffic that their internal tables say should come from here appearing to come from there. The result is not so much an instant KABOOM, but more of a series of sizzling pops that get progressively bigger. Things work on and off, things fail on and off. I was cackling like a mad scientist when we nailed this one down in testing.
It wasn't how I planned to spend my day, but it certainly was a lot more interesting than pushing paper.