Menu
 
Simpatico

FHIR Dispatches:
Smile CDR and HAPI Product Blog

GitLab show us exactly how to handle an outage

Published: 02 Feb 2017 at 01:37
By: James
 

I love GitLab. Let’s get that out of the way.

Back when I first joined the HAPI project, we were using CVS for version control, hosted on SourceForge. Sourceforge was at that point a pretty cool system. You got free project hosting for your open source project, a free website, and shell access to a server so you could run scripts, edit your raw website, and whatever else you needed to do. That last part has always amazed me; I’ve always wondered what lengths SourceForge must have had to go to in order to keep that system from being abused.

Naturally, we eventually discovered GitHub and happily moved over there – and HAPI FHIR remains a happy resident of GitHub. We’re now in the progress of migrating the HAPI Hl7v2.x codebase over to a new home on GitHub, too.

Along comes GitLab

The Smile CDR team discovered GitLab about a year ago. We quickly fell in love: easy self-hosting, a UI that feels familiar to a GitHub user yet somehow slightly more powerful in each part you touch, and a compelling set of features in the enterprise edition as well once you are ready for them.

On Tuesday afternoon, Diederik noticed that GitLab was behaving slowly. I was curious about it since GitLab’s @gitlabstatus Twitter mentioned unknown issues affecting the site. As it turned out, their issues went from bad, to better, and then to much worse. Ultimately, they wound up being unavailable for all of last night and part of this morning.

A terrible day for them!

GitLab’s issues were slightly hilarious but also totally relatable to anyone building and deploying big systems for any length of time. TechCrunch has a nice writeup of the incident if you want the gory details. Let’s just say they had slowness problems caused by a user abusing the system, and in trying to recover from that a sysadmin accidentally deleted a large amount of production data. Ultimately, he thought he was in a shell on one (bad) node and just removing a useless empty directory but he was actually in a shell on the (good) master node.

I read a few meltdowns about this on reddit today, calling the sysadmin inexperienced, inept, or worse, but I also saw a few people saying something that resonated with me much more: if you’ve never made a mistake on a big complicated production system, you’ve probably never worked on a big complicated production system.

These things happen. The trick is being able to recover from whatever has gone wrong, no matter how bad things have gotten.

An exercise in good incident management

This is where GitLab really won me over. Check their Twitter for yourself. There was no attempt to mince words. GitLab engineers were candid about what had happened from the second things went south.

GitLab opened a publicly readable Google Doc where all of the notes of their investigation could be read by anyone wanting to follow along. When it became clear that the recovery effort was going to be long and complicated, they opened a YouTube live stream of a conference bridge with their engineers chipping away at the recovery.

They even opened a live chat with the stream so you could comment on their efforts. Watching it was great. I’ve been in their position many times in my life: tired from being up all night trying to fix something, and sitting on an endless bridge where I’m fixing one piece, waiting for others to fix theirs, and trying to keep morale up as best I can. GitLab’s engineers did this, and they did it with cameras running.

So this is the thing: I bet GitLab will be doing a lot of soul-searching in the next few days, and hopefully their tired engineers will get some rest soon. In the end, the inconvenience of this outage will be forgotten but I’m sure this won’t be the last time I’ll point to the way they handled a critical incident with complete transparency, and set my mind at ease that things were under control.

Permalink