Westminster College does not currently have a backup generator protecting its datacentre and telephone system. That is about to change. Backup generators are no longer optional. That -- and other lessons learned -- from recent outages at the college.
Over a couple of weeks ago, the Westminster College datacentre experienced two total failures. Sunday evening, storms ripped through Fulton, Missouri and took electrical service with them. The batteries in our datacentre are good only for about 45 minutes of backup power before systems go down. That Sunday night's electrical outage lasted longer than 45 minutes, so the datacentre and all of its services, including networking, wireless networking, DNS, DHCP, ERP, email, internet gateway, etc, went down for the count. Although power glitches are not unusual here, it is unusual for an outage to last that long. I was three states away at the time, so my staff brought our datacentre back up and we went on our merry way.
I made the trek back from Louisville, Kentucky after having a fantastic visit with the TechRepublic staff (thanks, guys!). I got home late that day, so I checked my calendar and, lacking any meetings for the next morning, decided to sleep in and head into the office mid-morning.
The best laid plans...
Around 9am, my boss, the president of Westminster College, calls me on my mobile phone. He doesn't mind than I'm still groggy, but he does tell me, "You might want to come in. The hill behind Westminster Hall [our main admin building and home of our datacentre] gave way last night and took out electrical service to the building". Uh, oh.
I got ready as quickly as possible and went into the office to survey the damage. Indeed, the hill had begun to collapse. We've gotten a ton of rain this year and it finally caught up to us. During the collapse, the main electrical feed to Westminster Hall was torn, literally, from the transformer that powers the building as well as two other buildings and our datacentre. The transformer itself was damaged beyond repair. Our city electrical workers and the college's plant operations staff worked tirelessly that day to restore electrical service to, at a minimum, Westminster Hall, but they managed to get power back for all three buildings. The transformer was replaced and new, but temporary, service lines were run to the transformer so that the buildings could be energised. In all, we were down from about 11pm or so Monday evening until around 4pm Tuesday afternoon.
Without Westminster Hall, the college has no data network, no telephone system (the phone system batteries, after fighting valiantly for about 12 hours, finally succumbed to the inevitable), no internet and no servers. Worse, that day was the day that payroll had to be run. After fighting with a couple of inadequate backup generators, we finally simply moved the necessary hardware to another building and performed the tasks necessary to get payroll done. We also took advantage of the unplanned downtime to finish some work we've been wanting to do in our server room.
We learned a number of lessons that day:
|> A backup generator is no longer optional. We've actually already begun the planning to install a backup generator for our datacentre and phone system. An electrical engineer visited campus a couple of weeks ago to help us plan our efforts. Although I met no resistance from the executive team when I initially proposed this installation, that day's events sealed the deal in a way that I would never be able to articulate. Without our datacentre, no one could do their jobs. We sent people home and struggled to handle payroll. IT isn't a "side-by-side" operation anymore like it was in the old days. We can't just revert to paper and pencil to handle business operations.
|> You can't plan for everything. In our incident planning discussions, we never talked about the possibility of a landslide. This is Missouri. Flat country. Sure, we're on a hill, but this is a Missouri hill we're talking about, not some place from the Pacific shores! Our incident responses must be flexible enough to be applied to any incident, not just the ones we define as likely possibilities.
|> Focus on the critical things and consider the rest to be gravy. That day, payroll was job #1. Our last summer group left campus the previous week and we had no students or faculty on campus. And, people have to be paid on time and in an expected way. Early on, we decided to hold out for a generator that was being brought in by the city that would have been able to power our whole datacentre while the workers replaced the transformer. The generator was to be wired into one of our building panels that include the datacentre. After three hours of work, we found that the generator was not putting out the right voltage and it was determined that the unit was bad. So, in hindsight, we blew three hours of payroll processing time hoping that the "big win" (getting the whole datacentre energised) would come to fruition. Instead, we should have focused on the critical element -- payroll -- and looked at anything beyond that as gravy. Instead of waiting to start payroll processing at 2pm after moving servers to another building at 1:30pm, we should have immediately moved the servers that morning so as not to risk the 4pm payroll deadline imposed on us by our bank.
|> Have good relationships with outside agencies. Our city crews really did amazing work that day. They went out of their way to make sure that power was restored as quickly as possible. We enjoy good relations with the city, though, and I'm sure that goodwill played into our restoration.
The good news: I wrote this blog posting from my work computer on Tuesday night after the power had been up for a few hours. Although the situation we encountered was serious, there are a lot of takeaways to be had that we can now apply to our next situation and to improve our systems.
This was published in 


Leave a comment