GitLab’s Database Outage Postmortem

GitLab’s postmortem of the database outage of January 31 which resulted in significant loss of production data pulls no punches, and ought to be essential reading for anyone involved in software development. It has a lot in common with Vivarail’s report into the Kenilworth fire.

One element in the chain of events that led to the database crash raises eyebrows; an attempted hard-delete of the user account of a GitLab employee who had been maliciously flagged for abuse by a troll. It boggles the mind that a system would do such a thing without any human intervention. That’s either a serious coding error or some dangerously naive requirements analysis.

And this is especially damning.

Why was the backup procedure not tested on a regular basis? – Because there was no ownership, as a result nobody was responsible for testing this procedure.

When some important part of a complex system hasn’t been tested thorougly enough, it’s easy to blame the testers. But the blame usually lies higher up the project management chain.

This entry was posted in Testing & Software and tagged . Bookmark the permalink.

4 Responses to GitLab’s Database Outage Postmortem

  1. Michael says:

    The computer room where this happened has been decommissioned for some years now, but once upon a time there was a remote site with a lone support engineer who had to busk his way though all the procedures and problems using written instructions supplied from the main data center.

    One of these scripts was the weekly backup to tape. A stack of 10 tapes were provided.

    The time came for a restore and the tape was found to be corrupt. The other nine tapes were blank.

    The written instructions did not say put the backup tape at the bottom of the stack when you take it out of the drive. The same tape had been used since the computer room was commissioned and it had worn out.

  2. Tim Hall says:

    Ouch. Sounds like a serious training issue there.

  3. Michael says:

    Remember the engineer in question did not act against his training.

    The training given and the standing instructions made an incorrect assumption: that a different tape would be used for each backup without a specific instruction to do so being given.

  4. Tim Hall says:

    You do have to wonder why the engineer never asked what the other nine tapes were for. If it never occurred to him to ask it inplies a worrying level on incuriosity. If he was afraid to ask it doesn’t cast the management culture in a very good light.