Here's a suggested way to structure a 1 hr meeting: 

1 mins: go over goals of RCA. Stress that the point is to find systematic flaws that led to the incident being discussed. Blame should be assigned to processes, not people. 

4 mins: go over the timeline of events and the summary of the issue itself 

25-30 mins: How/why did this happen 

15-20 mins: How could we have prevented this

 10 mins: Distill and assign action items 

The person running the RCA should take notes in a visible way, either in this document itself on a big screen or handwritten on a white board. Try to understand the issue and problem areas as well as possible before discussing how this could have been prevented, though it's natural for the "how/why did this happen" and the "how could have have prevented this" sections to blend.

Summary of Issue

We tagged koa.2, but it wouldn't install.  The problem had been reported for some time, and already fixed on master.

Relevant Tickets

Timeline of Events

How did this happen?


Once you've created a timeline of events, analyze how the incident came to be. Different people like using different methodologies here: some do 5 whys; some do infinite hows (see below for links). The important thing is you try to understand what systematically went wrong to cause this kind of event to occur. The stress here is on "systematic": RCAs should not blame individuals, explicitly or implicitly. We want to have processes in place that are fault tolerant, so momentary lapses don't have disastrous consequences. The point of the RCA is to identify weak areas in our process and then to adjust our process accordingly. 

 

How could we have prevented it?

This goes hand in hand with the previous section. If the previous section was about identifying weak areas in our process, this section is about how a different process could have circumvented the incident the RCA is about. Try to avoid things that you can only see in 20/20 hindsight, like, we didn't have a test for this particular scenario. Often production incidents are the result of very roundabout edge cases, so it would have been impossible to imagine having to write the test that would have failed. Try to be more specific. Suppose you had an incident where a service went down, which had a cascading effect on services that depended on it. In this case, your suggestion might be something like "when developing features that depend on external services, don't assume they'll always be up, and have tests in place for these kinds of scenarios."

Action Items