So, over the course of a week we spent a bunch of time testing and troubleshooting. The good news is that the problem was generally reproducible, but our testing results were just ... odd. And, they sort of supported the idea of a network-level problem. However, my gut was telling me that it was a client or server issue.
The more I dug into the network, the stronger my hunch got. Thanks to many dinner-table conversations with taichigeek, I have some knowledge of how the inner workings of a database server, well, work. Yesterday afternoon I had my colleague and his minion perform the following test: go into ArcMAP and wait through the very long project open, then reboot and immediately dive right back into the same ArcMAP project. The result? Prior to previous tests where the interval between rebooting and starting ArcMAP was a lot longer,** this time they were able to jump into ArcMAP and load their projects at "normal" speed. Ah ha!
That result told me that the problem was not client side. The arrows were now pointing firmly at their ArcGIS database server.
This box is a hefty machine -- twin dual-core 3GHz Xeon (Dempsy) processors with 8GB RAM and 1.4 TB of usable disk -- running a 400GB+ database packed with geo-coded data and high-resolution 'pictometry' (very sharp aerial photos). I spent some time this morning staring at perfmon and rummaging through the event logs. I should have started here a week ago.
The event log was full of entries about SQL server timeouts trying to get buffers and latches, and very long instances (e.g. 680,000 ms - ~11 minutes) of trying to grow the transaction log for the GIS database. The machine needed the services of a competent DBA.
After lunch I went to pay my coworker a visit. His minion greeted me with "[He] has something to fess up to." Oh? He told me that mid-morning he had decided to take a look at the SQL database, and discovered that it had been a wee bit longer than he had thought since the last time he purged the database transaction log. How big was the log? Three hundred eighty gigabytes. He rather circumspectly admitted that their applications had been running fine ever since he pruned the log.
He also told me that he had called ESRI support and found out how to configure the transaction log so that it would never grow beyond 2GB (old entries would just be pushed out). And that is probably why I didn't kill him.
* In the final analysis this probably has a lot more to do with our to GIS people starting their days at 7:30 AM and most other city workers starting at 8:30 or 9:00.
** Due to things like shifting subnets, moving connections from one switch to another, etc.