For those that are curious—there was a power failure in the data center, which houses the primary Pilots of America server. At approximately 11:45 am CST yesterday all power was lost during a failed attempt to transfer from utility power to generator power. The transfer was necessary to accommodate scheduled maintenance.
The facility is equipped like most data centers. There are backup generators on-site along with a large UPS system. The UPS system provides temporary power if there is a loss in utility power. The diesel generators then automatically start and provide power to the UPS system until utility power is restored.
The above system didn’t work properly yesterday. The server was only offline for a couple of minutes and was then brought back online. The outage was so short that my monitoring systems didn’t alert me that the server was offline. Almost all the services on the server manually start themselves with the exception of our Livechat system.
I was informed yesterday afternoon that the LiveChat system was off-line. I started troubleshooting and fixed the chat system. I then noticed that the server was in the correct time zone (CST) but was set to EST time. For those that are familiar, it is setup to synchronize time via ntp. The problem was that the power outage caused such a dramatic change in time (1 hour) that the ntp system didn’t correct the time for fear that it was receiving incorrect information. I corrected the time manually and knew there would be problems.
The vBulletin application is fairly complicated and is of course database driven. Almost every action on this website that you do is inserted into the database with a time stamp. Once I rolled time back by one hour there were users that had actions in the future. This really screws with the applications logic.
I started digging through the Pilots of America database with the intention of fixing the new posts problem. The database is fairly difficult to interpret sometimes and eventually I decided that it was safer to just let people wait the 20 minutes before they could click new posts.
The other problem was that vBulletin sorts posts by time. This caused a small number of posts to get out of order in threads. It would make more sense if they sorted posts by the post id—but that would make too much sense.
All services are restored. No data was loss. A few posts got out of order and Scott couldn’t click ‘New Posts’ for 20 minutes. Overall, not that big of an event.