GitLab’s postmortem of the database outage of January 31 which resulted in significant loss of production data pulls no punches, and ought to be essential reading for anyone involved in software development. It has a lot in common with Vivarail’s report into the Kenilworth fire.
One element in the chain of events that led to the database crash raises eyebrows; an attempted hard-delete of the user account of a GitLab employee who had been maliciously flagged for abuse by a troll. It boggles the mind that a system would do such a thing without any human intervention. That’s either a serious coding error or some dangerously naive requirements analysis.
And this is especially damning.
Why was the backup procedure not tested on a regular basis? – Because there was no ownership, as a result nobody was responsible for testing this procedure.
When some important part of a complex system hasn’t been tested thorougly enough, it’s easy to blame the testers. But the blame usually lies higher up the project management chain.