?

Log in

No account? Create an account

Previous Entry | Next Entry

Disastrous Thoughts

We had IBM in yesterday afternoon to pitch their disaster recovery / business continuity consulting services. In addition to the usual IBM team, they brought a DR/BC specialist who has been working with the State for the past 3-4 years. Mark-my-unindicted-co-conspirator and I had an interesting talk with Agnes -- especially after the formal meeting broke up and the sales-types were out of the room.

Our chat gave me the seed of an idea for our own DR/BC planning that I have been pushing around since then. A little background is probably helpful.

The most common method for businesses to handle their IT disaster recovery needs is the traditional "hot site" recovery service (a la Sungard), where you pay a monthly fee for access to their nearest recovery facility (Philadelphia, for us) in an emergency. The contract gives you the equipment you specify and two opportunities a year to go into the hot-site and conduct recovery tests. This is very expensive. The contract gives the suits a nice feeling of security and it makes the auditors happy. But, IMO, it doesn't actually work all that well. Systems change. Equipment changes. And communication always seems to get short-shrift. By communication I mean the technical capability for the enterprise to connect to the systems running at the hot site. It's all well and good to take a bunch of tapes to Philly twice a year, do restores and run a dummy payroll, but how are you really going to make this work?

Another common option is the "in house" plan. If you are large enough to have business locations in enough different locations all over the country (or all over the world), you can replicate facilities and systems across your own sites. If your data center in Yonkers gets flooded, work moves to the data center in San Jose. This works if you have the sites, and a robust enough national/global network to survive losing a major site and handle the re-routed traffic. Being a singular municipality, that doesn't work for us. (But that does hint at something else I have been thinking of.)

I started thinking about co-location. Co-lo is how most major web sites handle housing their servers and obtaining access to the Internet. LiveJournal, for example, does not run their servers out of their own offices. They run at a hosting facility that leases space, power, cooling, and bandwidth to many different tenants (that's where the "co" in co-lo comes from). Think of it as an apartment building for servers.

First, some differentiation. Small time web sites will lease space on a server owned by their hosting provider. Many other small web sites will share the same box. As the site gets bigger, the next step up is to lease an entire dedicated server from the hosting company. Beyond that, sites get into co-location; they rent space and put their own servers in that space. (Here is a pic with britgeekgrrl showing off her then employer's web servers at their co-lo site. (original is here))

Now, what if we did that? We could rent co-lo space (maybe no bigger than the one in the image above). The first thing to go in would be a beefy VPN box. That would give us the ability to setup a site-to-site VPN between Hartford and our space at the co-lo facility -- effectively putting it on our network. Then we install several servers. We're already doing near-real-time replication between several of our production servers and hot standby servers in another computer room in the city. Those servers would simply move to wherever the co-lo facility is.

During normal operations, the VPN box runs the Hartford -- Co-Lo tunnel so that replication can run. During a disaster we use the VPN to enable individual workers to VPN in from their homes (teleworking) and to terminate site-to-site VPN connections from wherever we're running the City from. Picture a construction trailer with a DSL line, a little VPN box and some PCs perched on a knoll above swirling flood waters and you'll get what I'm driving at. (Hint #2 at that other thing I'm thinking about.)

I see several advantages here. Leasing co-lo space (and the Internet bandwidth needed to make it work) is probably in the same price range as a traditional hot site contract. The servers will be ours, and they'd be "hot" all the time. In a disaster we wouldn't be trying to restore systems onto a hot site provider's machines. We'd be up faster and with current data, not data from whenever the last tapes were run. Because the communications path and the servers have to be working all the time for the data replication to work, we'd know that the systems are there and working. It's a greater ongoing administrative burden, but it brings with it assurance that things will work when the chips are down. It's something we can test whenever we want to, not on the biennial basis you get with a traditional hot site contract. And it forces us to deal with the communication question right up front. It's nearly all up-side, as far as I can see.

I'd love to hear from anyone with real DR/BC experience. Does this make sense? Am I nuts?

Oh, and the other thing I'm noodling on. Location. A national business has customers all over the nation. An international business, all over the world. We don't. We're a municipality. Our customers are all in one place. There all here. A corporation doesn't care where its operations are, as long as its customers can place orders and receive products. A city does care. In a disaster we (city government) need to be in Hartford, helping Hartford's people. Katrina and NOLA made that abundantly clear. So, the question of "how do displaced workers reach the systems at the recovery site" is extra important, because many of those workers will be trying to work in the disaster zone.

Tags:

Comments

( 5 comments — Leave a comment )
also_huey
Apr. 26th, 2007 02:16 am (UTC)
Picture a construction trailer with a DSL line, a little VPN box and some PCs perched on a knoll above swirling flood waters and you'll get what I'm driving at... ...the question of "how do displaced workers reach the systems at the recovery site" is extra important, because many of those workers will be trying to work in the disaster zone.

The colo solves half of the problem - how do you maintain your core functionality and data in a disaster. Problem #2 is, okay, so a disaster has happened, and you haven't lost anything, but you can't use any of it either, because your infrastructure has no power and no intarnets.

Your mobile command center needs 1) a generator, 2) some EVDO cards or a microwave horn or a small satellite uplink or something, and something to route out of whatever that is, and 3) enough other hardware to provide some kind of useful service - some computers, maybe a dialin concentrator or another router or both, for when the phone company can run temporary lines to you.

If I was doing this on the cheap, I'd get one of those old national guard surplus contact trucks (a 4x4 pickup with a gas-engine-powered welder and some toolboxes on the back of it), a rev A sprint EVDO card or two, a cheap EVDO router or two, a couple of laptops, an old indestructible printer like an HP4 (in a crisis, everybody wants to fucking print things. No idea why.) and some power conditioning so the generator doesn't piss off any of the doodads. Oh, and one of those big silver-bullet 40-cup coffee machines, can't forget that. Add all that stuff up, and you're in business talking to your mission critical data that's safe back at your secure colo.

Instant mobile IT command center: $20,000.
Internet hotspot, anywhere in the mid-atlantic states, in the time it takes you to drive there? Priceless.
mapmakr
Apr. 27th, 2007 02:02 am (UTC)
End of the world Issues
So what about us very little tiny folks (not Government or Big Business)?
paper?
netcurmudgeon
May. 15th, 2007 09:25 pm (UTC)
Re: End of the world Issues
Weeeeeel, you're in a very different position from a large enterprise. Most of your business assets are either digital data (which you already have in portable forms) and the gray matter between your and kriz1818's ears.

So, the question "can I work" for you boils down to: access to your data, access to a PC with the software you need, and access to power and phone/internet. The last three you can find in a hotel. The first two depends on whether you have time to grab a laptop and your external drives before the disaster hits. Or, whether you have backup copies of your data and programs at a secure second location that you can reach. A new machine can come from Staples or CompUSA.

Ready.gov purports to have good planning info for putting together a family / business disaster plan. This FEMA site: "Prepare for a Disaster" appears to be a web version of a FEMA self-study course I have in PDF form.
half_elf_lost
Apr. 27th, 2007 03:48 am (UTC)
I have abolutely nothing good to say about IBM. We've sold our souls to them to run more and more pieces of our business and NONE of it has been to the good. I'd make your hair curl over the sophisticated learning management system they sold us that's the biggest piece of crap - ever. The run all our HR systems and support with almost no emergency backup (they CLAIM it's in Colorado) from Costa Rica and regularly get wiped out by hurricanes.
netcurmudgeon
May. 15th, 2007 09:14 pm (UTC)
Yeah, our experience with them last year on the 'Wireless Hartford' project were 80% negative. The 20% positive was due entirely to one man.

The New Boss seems intent on pressing ahead with IBM ... which means that 'us line managers' are going to a) have to do most of the work, and b) watch IBM like a phalanx of hawks.
( 5 comments — Leave a comment )

Latest Month

January 2017
S M T W T F S
1234567
891011121314
15161718192021
22232425262728
293031    

Tags

Powered by LiveJournal.com
Designed by Lilia Ahner