Yesterday Asha and I cruised Barnes & Noble for a few minutes killing time between appointments. I was trolling for something pointless to read. I picked up Whole Wide World by Paul McAuley, a debut Brit author. The story is a near-future cyber-crime police procedural. The setting is London in the 2000-somethings. The camera network that already riddles real world London has achieved complete saturation. Every communication is watched and recorded. I'm on page 101 of 376, and it's gotten me thinking again about the Gummint and information surveillance.
Some people believe that the government is tracking every phone call, fax, and email. It appears that surveillance projects like ECHELON actually exist in some form, purportedly poring over millions of calls and transmissions that are sent via either satellite or terrestrial microwave in Europe, Asia and Africa. The revealing of the FBI's Carnivore system spiked a lot of concern a few years back. These things make a lot of people nervous, some of them nervous enough to do things like encrypt all of their emails with PGP. But I have never really bought it.
I don't buy it in the same way that I couldn't make myself accept all of the Y2K doomsaying. You remember Y2K. Being the cautious type I did do some pre 1-1-2000 prep, but nothing more than what FEMA and the Red Cross say people in northern climates should to before any winter (eg: have batteries, flashlights, a battery-powered radio, several days of non-perishable food items, water, and emergency cash). At the root I kept wondering 'how is going from 99 to 100 really going to break things?' I mean, even if you only put in two digits for the year, the binary representation has to be at least seven bits (0-127) if it's holding '99', and it's far more likely to be a short integer if it's anything (0-255). So, why will computers go *boom* when they add 1 to 99 and get 100?
The answer is, they didn't. A lot of programs showed little cosmetic bugs (a la 'year 100'), but the math kept working and the computers kept working. The problem it seemed, was grossly overstated and not well understood by the people who were calling out the alarm form the rooftops. All sorts of reputable people believed that there would be huge problems. In '98 and '99 we received quarterly Y2K status papers from the Gartner Group (with whom my employer at the time had a consulting contract). Gartner is one of the top four IT analyst outfits. They and their peers were projecting global economic disruption of at least moderate scale. Others were preparing for The End Times.
I don't buy the surveillance fear for the same reason: some things just don't add up. For instance, Carnivore needs to sit astride a target's ISP's feed to their higher-tier ISP. Carnivore needs a choke point where all of the suspect's email traffic must pass through in order to do its thing. Bear with me now I'm going to make some serious side trips to pull in all of the pieces I need to make my case.
The Internet is highly distributed in it's design. That was the original intent of the ARPAnet, to build a network to see if you could achieve survivability through redundant paths and flexible protocols. If you look at the protocol stack, all of the middle layers in-between your application and the physical network connection are designed with the assumption that the physical network is unreliable. There's a lot of overhead built in to handle packets that get lost, get fragmented and put back together, or take different paths to their destination and wind up arriving out of order. There are many paths to get from one place on the Internet to another, and precious few choke points. That's one of the things that the big Tier 1 ISPs have specifically spent the past decade working to eliminate by building lots of links and interconnections with their peers.
The software that runs on top of the network is also highly distributed. Even centralized services like DNS are really hierarchies of servers that distribute the workload, breaking up the information into chunks of a size that the servers can handle. This distribution of work and information model is what makes the illegal file sharing networks so hard for the RIAA and MPAA to take down; Napster was successfully targeted because it maintained a centralized information store. There is no central email routing service (the basic lie behind that perpetual "congressional email tax" hoax). When you send an email to a friend, your mail client send the email to your ISP's mail server, which relays it directly to your friend's ISP's mail server. It sits there until your friend accesses it, either with a mail client or a web interface. Web surfing is even more direct once your PC resolves the server's numeric address through DNS, it's just you and the server having a direct conversation.
ECHELON works by eavesdropping on transmissions made through open air. Carnivore by sitting astride a choke point. I know that Carnivore can work, because I administer Hartford's two choke points. With my current firewalls and their logging capabilities can tell everywhere you go on the web if you're sitting at a PC on my network. My investigative reports have cleared teachers of bogus charges of classroom porno surfing, and landed a city employee with a seventy day unpaid suspension for kiddie porn (the State Police investigation is ongoing, he may well face criminal prosecution). I should be getting even more capable firewalls this spring.
With a network sniffer I can capture literally everything that goes in and out over a network link, whether it's the line from your PC to the closet or one of our two Internet feeds. If I had the budget of a national spy agency I could cache all this traffic on big storage arrays and pore over it with analytical programs to pull out 'suspicious' or 'interesting' streams of web traffic or emails.
But and this is a big BUT aside from doing searches based on lists of suspicious words, I would need some seed to start with in order to get anywhere. There has to be some kernel of suspicion, otherwise you're just trolling with a word list. I have several such lists. One of them 'words.porno' contains 36 words. Most of them you cannot say on broadcast TV. I have honed this list over the past five years to be an effective screen that I pass web access logs through. It's now pretty reliable at telling me whether someone has been surfing porn or not. Part of the honing process was learning what words couldn't be used because they generated too many false positives (ass, for instance appears as part of many harmless words like compass, or assignment). I still get false positives, but it's down to a level that I can quickly identify them by eyeballing the report.
So, if I have the budget of a large federal agency and I have placed network traffic capture systems here and there, I could sample a chunk of the U.S. Internet traffic (I could more easily sample traffic as it enters and leaves the U.S., as there are a finite number of transoceanic fiber cables again, choke points) and go over it with analytic tools. I would have to do my first wave of screening very fast, as in, in real-time because the volume of Internet traffic is freaking staggering.
For example, right now CEN is pushing 120Mbps. That means a traffic capture would fill a CD every 48 seconds. This is on a day when most schools in the state are off for February break. A gig a minutes on an off day, and that's just from some schools and libraries in punky little Connecticut. Scale that out to the millions of Internet using Americans and you can see the size of the data collection and analysis problem.
Right now network equipment manufacturers are pushing most of the work of handling traffic into ASICs (Application Specific Integrated Circuits) chips custom 'spun' to do one job and to it really really fast. Traditional software-based routers and switches running on general-purpose CPUs can't deliver the throughput (or latency) needed to handle traffic at gigbit+ levels. A sophisticated traffic screening system could be built to run on ASICs, but the more you ask it to do, the slower it is. ASICs-based switches can push line-rate gigabit and 10 gig Ethernet on multiple ports. Yet, one of the fastest ASICs-based firewalls (in fact, the type I will be buying soon) which can switch traffic at those high rates has anti-virus screening that only runs at ~160Mbps. Much slower, and they only look for a short list of currently active viruses (a 'wild' list, vs. a canonical 'zoo' list of every known virus).
Picture all of the things and people that the government might want to know about. Now imagine writing a set of rules for matching the names of these people or things in emails or web pages. You must weed out 99.9% of the junk on the first pass because storage space is finite and you only have a tiny number of human analysts to look at traffic. Oh, and if it's foreigners your interested in, system has to be polylingual. Now, cram it into ASICs and get it to run. Oh, we can't forget the problems of decoding audio files, pulling things out of video streams, or image files. Heck, there's a whole discipline in computer science centered around hiding things in images. It's called steganography.
I just don't think it can be done. I think that that Internet is too diversified physically, and the traffic rates are way too high for anyone, even the U.S. government, to capture and scan significant fraction of it. I might well be wrong. I could be making the dangerous move of working from my small base of experience and trying to extrapolate to a much larger problem where other people with resources I can only imagine may have developed ways of managing it. But, I don't think so. In the same way that my gut made me a Y2K skeptic, my gut tells me that unless you have specifically attracted the attention of the government for something, then Uncle Sam is not routinely reading your email.