Greetings. I’d like to take a moment to talk about the complete service outage we experienced on April 4, 2017, what happened as best I understand it, and what we’re going to learn from the experience and do better in the future.
First off, the most important point: we’re back up and running, and all checks I’ve been able to do since the server came back up have indicated we lost no data. I’m in the process of catching up on emails and such we didn’t get during the outage now.
On April 3, I was doing some development on the server and noticed it was running awfully slow. I should have said something to support, but I figured my code wasn’t very well optimized yet, it was late and I was tired, and I left it alone. When I went to bed on April 4 at 5:15AM, the server was slow, but functional. When I woke up at 9:00AM, it was completely down. I reached out to the support team at our hosting provider immediately.
IGA runs – currently – on one virtual private server, which is basically a private pie-slice of a massive server’s resources. Everything’s supposed to be redundant and backed up and magical. For those who aren’t up on datacenters and stuff, commercial-grade servers don’t just have one hard drive like your desktop computer does; they generally have 4 or 6 drives at minimum, and they can be configured such that multiple drives have a copy of the data, because you don’t wanna lose stuff. The idea is that, because drives fail, you can lose a drive or two and still be OK, you just have to replace that drive and carry on; the array just heals, as long as you don’t lose a bunch of drives all at once. We’re actually on a SAN, which is a giant disk array with many disks, shared by multiple servers.
SANs are managed by a special controller card that is in charge of reading and writing the data. Turns out, our SAN’s controller card was dying, and as it did, it was effectively ruining swaths of the drives. The slowdowns I was experiencing on April 3 were probably the SAN trying to find a usable copy of the data on other drives after the first went down. At 7:32AM on April 4, the datacenter team realized what was happening and pulled the plug. This took our server offline, but also protected our data.
The datacenter team then replaced the bad controller card and built a whole new SAN for us and the other sixty or so virtual servers that were using it. Now comes the real challenge: putting the data back. Since there was no complete copy on any one set of disks on the old SAN, because of the corruption, the datacenter team had to write a custom script to basically scour all the disks and reassemble the data from the still-usable bits. Between writing the script and then the very slooooow process of pulling the data back given that they were using the SAN in a way it wasn’t intended to be used, and because we’re talking about terabytes of data across all the affected customers, this took a while. We first got indications that indiegamealliance.com was back online at approximately 1:18AM on April 5.
Most companies would have just turned the server back on with missing data and tell people, “if you didn’t have a backup, tough luck,” but the crew at HostDime went above and beyond as they always do. I’m pleased to report that all checks I’ve been able to do since the server came back up have indicated we lost exactly no data. As an aside: HostDime is a fantastic company to host with specifically for reasons like this – and IGA Pro members save 10% when they use our referral code.
While we dodged a bullet here, there was a period of time where the HostDime staff didn’t know if they could restore our data, and that left me spending the morning assembling our backups and seeing what all we had and didn’t have on recent copies. All in all, I could have restored about 96% of our data, which isn’t catastrophic, but that’s not good enough for me. I did a full audit of our backup situation, and here’s what I found.
IGA’s collected information footprint is big: about 12 gigabytes worth. Once more people start using the new print and play hosting feature, that will reach hundreds of gigabytes in very short order. It’s impractical to just download it every day to have a backup. I was signed up for a “remote backup” service with the hosting provider, but it only has a certain amount of backup space. The problem is that apparently, there’s a setting you have to turn on that tells the backup process, “delete old backups when you run out of room, to make room for new ones.” To my shock and dismay, that was not turned on, so the newest backup we had there was from November 30, 2016. I had received email warnings about them, but they were cryptic Linux system logs and I didn’t understand them well enough to decipher what was wrong and fix them. I am a coder, not a systems administrator, and the emails generally say stuff like “A system event has occurred” as opposed to “Your backups are broken! Fix them now!” So, that backup would have been good good for some of the older, legacy stuff, and a few other non-IGA static domains I host, but not terribly helpful for IGA as a whole. This was the primary line of defense, and it was basically useless in this case.
This vulnerability has now been fixed, with the aid of HostDime staff. The pruning issue has been fixed, making sure it’ll never fail for disk space. We also found an error in one of our testing databases that was giving the backup process some fits, so we took care of that as a precaution as well. I’ll be downloading the generated backups on a regular basis so that we have extra copies, even after the old ones are deleted from the online storage.
Without a comprehensive backup of the site, it was time to hit application-level backups, which really should be the last line of defense.
I work on IGA’s website code out of a Google Drive share, which means I have the current code stored in Google’s cloud. Because I use the Google Drive desktop app, I also have a fully-synchronized copy of my Google Drive on every machine I use, which is two laptops, my desktop workstation and my phone. Additionally, I check in code to GitLab.com after every major revision, so that’s yet another, albeit slightly outdated at the moment, copy. So, the IGA code was fine, and we would have suffered no data loss on that front.
Most IGA staff members use GMail as our mail client, for its handy Android integration, but it has another super bonus feature – it stores a copy of all our email when we check it. So other than emails that have been received during the outage itself, we appear to have lost nothing on that front. Hooray, clouds-talking-to-clouds!
A few weeks ago, I read about a massive database failure at a large company, made worse by lack of backups, and I had a bit of a panic attack and started implementing a plan to improve the IGA backup structure, which was not at all as robust as I’d have liked it to be (as you’ll see.) It’s a very, very good thing I did. Focusing on the most important stuff first, I wrote a script that executes every 4 hours that effectively takes the main IGA database, compresses it to a file, and sends it to me securely offsite. So, I had a snapshot from 4:20AM. The 8:20AM snapshot never came, so we knew the shutdown occurred between 5:15AM and then. (It was actually 7:32.) Fortunately, not a lot of data moves around during those hours, so very little if anything would have been lost. I have accelerated the frequency of this backup to hourly as an extra precaution.
We didn’t have one of our supporting databases – the WordPress installation that runs our little news blog – included in our backup. We would have been able to recover most of the posts through other means, and it wouldn’t have been a huge deal to lose, but adding that database to the backup regimen we’re already backing up was trivial and was omitted purely as an easily-corrected oversight on my part. This has now been done as well, with a frequency of twelve hours (as it’s very rare we make more than one post in a day, this should be plenty, and we don’t want to fill up the backup drives with endless identical copies.) WordPress has a built-in backup option, but it costs money to use and WordPress is such a small portion of what we do that I’ve never activated it.
So that’s code, email and databases secured, which are the biggest things. Now, on to data files. This one’s a toughie. We store image files that are linked to games, demo report pictures, and user avatars (publisher logos, mostly) and very recently started accepting uploads for print and play files. Because I don’t use them in the development side, they aren’t included in the backups of the code, and because they aren’t stored directly in the database, they aren’t in the database backups either. We were counting on the backups from the remote FTP (the ones that hadn’t successfully run since November 30) to get those, and we didn’t have a backup for the backup yet (more on this below.)
The system administration team at HostDime had a full-system backup of the SAN from February 4 of this year, so no files our users uploaded to the site before then would have been missing. Of the remaining ones, I would have been able to write a script to re-download the missing game and company logo files from BoardGameGeek (which is where we got most of them in the first place), but demo report pictures, PNPs and such uploaded post-February 4 would have been lost. This appears to have been the most damaging potential impact of this event.
Immediately after the server came up, I took a snapshot of the entire images directory, PNP cache and other user-uploaded content to make sure we had a current copy. That, like my code, is now on Google Drive for safe-keeping. We don’t yet have an automated solution for keeping this data backed up, but I’ll do it manually every couple of days until we do.
So, where do we go from here?
At present, we feel pretty good about the backup strategy for the code itself; as I said, it exists in real-time on four devices I control, plus in two different clouds (GitHub and Google Drive), plus the production server itself. That said, the “full server” backup strategy doesn’t differentiate between a code file and an image file, so all the improvements we make to the full server strategy will provide even more security for the code.
The database backup seems to have been pretty effective, but I don’t like how much I’ve held my breath today wondering if it was there, current, and would restore properly, so we’re going to do some more about that as well. First off, as I mentioned, we’ve increased the frequency to hourly and added in the support databases like the WordPress stuff. As with the code, it’ll also get swept up in any full-server backup strategy we employ.
Before this incident, as part of my panic mode a few weeks ago, I provisioned a second server intended to serve as a live backup for the database. Every time someone alters the production database, it would also send that change to the backup server such that it remains current to within a few seconds. In database lingo, we call this replication. Problem is, I didn’t have it fully set up yet, so it wouldn’t have helped us in this event. I had it scheduled for next week.
Once the replicated server is in place, the hourly snapshots will become a tertiary backup to the full-system snapshots HostDime takes for us and the replicated server as the second line of defense. The snapshots will also provide extra options for restore if something goes wrong with the replication, and help protect us from “bad query” results – stuff like accidentally deleting all the games instead of the one I meant to because I typed the wrong query. (Replication doesn’t generally help here, because the replication server will happily execute the bad query you just wrote as well if you don’t catch it fast enough.)
I’m going to look into the feasibility of having the code and image files automatically synchronized with the replication server as well. In theory, we could then promote the replication server to be the new “production” environment in the event of an outage if something like this were to happen again – the service would be much slower (because I can’t afford redundant high-powered equipment) but we’d at least be limping along until the primary came back up.
We’re going to take a two-tiered approach to the full-server backup. First off, we’ve turned on the auto-pruning feature so that the backup process we’re already paying for will start working again and continue to work even after the disk fills up. I’ll be writing a script to automatically download those snapshots, so I can archive them myself locally. Copies I can physically lay my hands on at need give me the most warm and fuzzies. This will provide yet another copy, and permit me to have older snapshots on hand even after the online backup has been purged. This will also give me more visibility in the form of big blaring alarm emails if something goes south with the backup strategy. This will get enormous, but if I have to throw the old ones away after a year or keep a bucket of USB drives around once they get super old, it’s a small price to pay.
I’ll also be employing an “always ask support if I get a nasty-gram from cPanel” approach, so that obtuse errors in log files I struggle to understand will no longer lead to real-world consequences.
Our hosting provider is spinning up a cloud hosting platform with much greater data redundancy (because it’s on a gigantic SAN). Having a server there, if it doesn’t get much traffic, is just a few bucks a month, and at that price, why not? I’ll be looking into using this to set up a second fully redundant server there as soon as that platform is available (HostDime says it’ll be a few weeks.)
In summary, once this all is done, our backup strategy will look like this:
- Fully-replicated servers that can be promoted in the event of an outage. Data-complete to within a few seconds of real-time.
- Already-provisioned smaller VPS (coming soon)
- Cloud virtual server (coming soon-ish?)
- Weekly/monthly HostDime-provided backups
- Will be FTPed to a location HostDime can access for rapid restores
- Will be downloaded and archived locally at IGA HQ
- It’s entirely possible that HostDime will also have backups of the backup server(s) to pull from in an emergency as well.
- Application-level backups
- Code: 4 local machines, Google Drive, GitHub [Frequency: real-time]
- Email: GMail [Frequency: within 1-2 minutes of real-time]
- Databases: Snapshots on Google Drive [Frequency: hourly for primary DB, twice a day for secondary DBs]
- Data files: None. We’re relying on one of the methods in Tiers 1-2. Have any other ideas for something cost-effective? I’m all ears. We’re considering third-party solutions like BackBlaze here.
I’m confident that these strategy improvements will be affordable and relatively simple to set up with help, and will make sure we never experience significant data loss again. And the takeaway for us all: If you don’t have at least three tested backups of something, assume you don’t have any.
We apologize profusely for the inconvenience of all of this, and give you our word that we will use this experience it as an opportunity to learn and improve. We’d also like to thank the staff at HostDime, specifically Kevin, Pat, Aric and Joe, for working more than 18 hours straight to save our bacon.
Update 1: April 5, 2017 6:00PM Eastern
We’re moving fast on making these improvements. Today’s accomplishments so far include:
- Adding the WordPress database to the list of application-level backups
- Increasing frequency of the primary database backup from every 4 hours to every hour
- Correcting the prune operation and adjusting all settings on the primary server backups to be more efficient and effective
- Cleaned out the old backups to make space
- Got a bit of an education on how the backup automation process works
- Increased full-system backup frequency from weekly to daily
- A few days at a time worth of full-system backups will now be archived on the server itself as well as on a remote archive within HostDime
- An initial copy of all 4GB worth of user-uploaded PDF and image files was conducted early this morning so we have at least one temporary backup
- A full system backup test was conducted and successfully performed
I’ll be driving up to the local Best Buy in about 20 minutes and picking up a 3TB external USB drive, which I’ll then be automating downloads of the full-system snapshots to nightly for extra-extra-extra protection and for the ability to retain longer-term archives of data without needing to pay for expensive server space. Once that’s done, our passive backup strategy will be in place, and we can start working on the active (redundant server) setup. That will take a little longer than just copying files, but we’ll let you know as we’ve gotten something up and running.
Thank you to all for your patience and encouragement as we’ve gone through this important, painful step toward being a more “big-league” operation. 😉
Update 2: April 6, 2017 10:07AM
At this time, a new external hard drive has been installed at IGA HQ, and a script running on it is now automatically pulling the nightly full system backups down off the FTP server for long-term archival and retention, ensuring we’ll be able to keep full daily snapshots for nearly a year, and then archive those down to maybe one snapshot a month once they’re than 6 months old or some such.
This is the last piece of our disaster recovery backup strategy, barring any improvements suggested by the community. We’re now 100% confident that we could come back pretty much unscathed from losing the server. The next step is active recovery, which is our real-time redundant servers. Those are going to take a little longer to set up and configure, so we’ll be wrapping up a few projects we had started before this occurred, and then focusing our development efforts on that.