Our chat gave me the seed of an idea for our own DR/BC planning that I have been pushing around since then. A little background is probably helpful.
The most common method for businesses to handle their IT disaster recovery needs is the traditional "hot site" recovery service (a la Sungard), where you pay a monthly fee for access to their nearest recovery facility (Philadelphia, for us) in an emergency. The contract gives you the equipment you specify and two opportunities a year to go into the hot-site and conduct recovery tests. This is very expensive. The contract gives the suits a nice feeling of security and it makes the auditors happy. But, IMO, it doesn't actually work all that well. Systems change. Equipment changes. And communication always seems to get short-shrift. By communication I mean the technical capability for the enterprise to connect to the systems running at the hot site. It's all well and good to take a bunch of tapes to Philly twice a year, do restores and run a dummy payroll, but how are you really going to make this work?
Another common option is the "in house" plan. If you are large enough to have business locations in enough different locations all over the country (or all over the world), you can replicate facilities and systems across your own sites. If your data center in Yonkers gets flooded, work moves to the data center in San Jose. This works if you have the sites, and a robust enough national/global network to survive losing a major site and handle the re-routed traffic. Being a singular municipality, that doesn't work for us. (But that does hint at something else I have been thinking of.)
I started thinking about co-location. Co-lo is how most major web sites handle housing their servers and obtaining access to the Internet. LiveJournal, for example, does not run their servers out of their own offices. They run at a hosting facility that leases space, power, cooling, and bandwidth to many different tenants (that's where the "co" in co-lo comes from). Think of it as an apartment building for servers.
First, some differentiation. Small time web sites will lease space on a server owned by their hosting provider. Many other small web sites will share the same box. As the site gets bigger, the next step up is to lease an entire dedicated server from the hosting company. Beyond that, sites get into co-location; they rent space and put their own servers in that space. (Here is a pic with britgeekgrrl showing off her then employer's web servers at their co-lo site. (original is here))
Now, what if we did that? We could rent co-lo space (maybe no bigger than the one in the image above). The first thing to go in would be a beefy VPN box. That would give us the ability to setup a site-to-site VPN between Hartford and our space at the co-lo facility -- effectively putting it on our network. Then we install several servers. We're already doing near-real-time replication between several of our production servers and hot standby servers in another computer room in the city. Those servers would simply move to wherever the co-lo facility is.
During normal operations, the VPN box runs the Hartford -- Co-Lo tunnel so that replication can run. During a disaster we use the VPN to enable individual workers to VPN in from their homes (teleworking) and to terminate site-to-site VPN connections from wherever we're running the City from. Picture a construction trailer with a DSL line, a little VPN box and some PCs perched on a knoll above swirling flood waters and you'll get what I'm driving at. (Hint #2 at that other thing I'm thinking about.)
I see several advantages here. Leasing co-lo space (and the Internet bandwidth needed to make it work) is probably in the same price range as a traditional hot site contract. The servers will be ours, and they'd be "hot" all the time. In a disaster we wouldn't be trying to restore systems onto a hot site provider's machines. We'd be up faster and with current data, not data from whenever the last tapes were run. Because the communications path and the servers have to be working all the time for the data replication to work, we'd know that the systems are there and working. It's a greater ongoing administrative burden, but it brings with it assurance that things will work when the chips are down. It's something we can test whenever we want to, not on the biennial basis you get with a traditional hot site contract. And it forces us to deal with the communication question right up front. It's nearly all up-side, as far as I can see.
I'd love to hear from anyone with real DR/BC experience. Does this make sense? Am I nuts?
Oh, and the other thing I'm noodling on. Location. A national business has customers all over the nation. An international business, all over the world. We don't. We're a municipality. Our customers are all in one place. There all here. A corporation doesn't care where its operations are, as long as its customers can place orders and receive products. A city does care. In a disaster we (city government) need to be in Hartford, helping Hartford's people. Katrina and NOLA made that abundantly clear. So, the question of "how do displaced workers reach the systems at the recovery site" is extra important, because many of those workers will be trying to work in the disaster zone.