Here's a suggested way to structure a 1 hr meeting:
1 mins: go over goals of RCA. Stress that the point is to find systematic flaws that led to the incident being discussed. Blame should be assigned to processes, not people.
4 mins: go over the timeline of events and the summary of the issue itself
25-30 mins: How/why did this happen
15-20 mins: How could we have prevented this
10 mins: Distill and assign action items
The person running the RCA should take notes in a visible way, either in this document itself on a big screen or handwritten on a white board. Try to understand the issue and problem areas as well as possible before discussing how this could have been prevented, though it's natural for the "how/why did this happen" and the "how could have have prevented this" sections to blend.
Summary of Issue
We tagged koa.2, but it wouldn't install. The problem had been reported for some time, and already fixed on master.
Relevant Tickets
Timeline of Events
- Jan 17: Fix on master: https://github.com/edx/configuration/pull/6248
- Jan 26: Discourse thread reporting the problem: https://discuss.openedx.org/t/failed-to-install-koa-in-aws-ubuntu-20-04-mysql-client-error/4132
- credible report
- link to master fix
- CI Koa reports:
- Failed: Jan 27: Ansible task name: oraclejdk : Download Oracle Java
- Failed: Jan 28: edxapp : install system packages on which LMS and CMS rely
- Passed: Jan 28
- Passed: Jan 29
- Failed: Feb 1: Please ignore this failure as the app server termination was trigged by me.
- Passed: Feb 2
- Feb 2: Adolfo created BTR-61 in order to fix the Jan 26 reported issue before tagging Koa.
- Feb 9: live-tagging koa.2
- Feb 10: report that koa.2 won't install for anyone following the native installation instructions
- Reported by Pierre
- Ned cherry-picked fix to koa.
- Adolfo tagged koa.2a with the fix.
How did this happen?
- Why didn't we catch this?
- Why was Koa.2 tagged without the fix?
- Why was Koa.2 tagged without testing the tag?
- Why did CI not catch this?
- Why is CI currently not running the exact native installation procedure?
- Because it's running OpenCraft's customization, which didn't catch the database issue.
- Why are we running OpenCraft's customizations?
- Because OpenCraft offered, using their pre-existing sandbox builds. The sandbox builds are what they had already.
- How might we enhance OpenCraft's CI implementation to run the exact native installation procedure.
- CI is difficult to maintain - fails for obscure reasons.
- Why did we believe that CI would catch it?
- Why didn't we have the deltas between the native installation and what the CI was doing?
- Because it isn't intended to not have large deltas.
- For cost reasons, we are using the same database.
- Because the platform is complex and can result in unanticipated issues.
- Why did we miss addressing Koa open issues before tagging?
- Why didn't we act on the discourse thread?
- We did create a ticket for ourselves.
- Why did we miss fixing BTR-61, which was tagged with Koa?
- How might we use GitHub milestones to not miss such in the future?
- Could have used Jira tickets with the koa.2 label.
- Because the current playbook has the mechanics of tagging the repos, but doesn't cover process for ensuring all needed fixes are in.
- How might we update our tagging playbook to catch this in the future?
- Include concrete checklists for verifying that tickets were completed.
- How might we include severity and priority on open issues?
- Why did it take 2+ weeks to react?
- Discourse → BTR ticket took a week.
- Why are we relying on a single individual to do this?
- How might we have more people helping others in the community in Discourse?
- Why are we using multiple reporting tools? (Discourse, Jira and now GitHub issues)
- Why did we miss that this was a critical issue for the native install, for more than 1 week?
- Because there are several issues at times.
- Why did we not include a fix from master?
- Master moves at a fast pace. We are not looking at each commit. We are looking at security fixes right now.
- Fixes that are pushed to master are not included in fixes to other branches (named releases)?
- Why did the PR author miss it?
- How might we make use of Conventional Commits and/or Pull Request templates to improve catching this?
- Why did the BTR group miss this?
- Why did the Koa.2 installation break in the first place?
- Why does the BTR group not "Dogfood" the native installation?
- There needs to be a process or reason to do this.
- Why is there no process for this?
- Why is there no reason to do so?
Once you've created a timeline of events, analyze how the incident came to be. Different people like using different methodologies here: some do 5 whys; some do infinite hows (see below for links). The important thing is you try to understand what systematically went wrong to cause this kind of event to occur. The stress here is on "systematic": RCAs should not blame individuals, explicitly or implicitly. We want to have processes in place that are fault tolerant, so momentary lapses don't have disastrous consequences. The point of the RCA is to identify weak areas in our process and then to adjust our process accordingly.
How could we have prevented it?
This goes hand in hand with the previous section. If the previous section was about identifying weak areas in our process, this section is about how a different process could have circumvented the incident the RCA is about. Try to avoid things that you can only see in 20/20 hindsight, like, we didn't have a test for this particular scenario. Often production incidents are the result of very roundabout edge cases, so it would have been impossible to imagine having to write the test that would have failed. Try to be more specific. Suppose you had an incident where a service went down, which had a cascading effect on services that depended on it. In this case, your suggestion might be something like "when developing features that depend on external services, don't assume they'll always be up, and have tests in place for these kinds of scenarios."
Action Items
- Former user (Deleted) Update the runbook as follows
- Don't tag the release without manually installing it.
- Ensure all must-have tickets are completed.
- Former user (Deleted) Write and distribute roles
- Write down required roles for BTR
- e.g. Manual QA, Triage issues, Discourse assistance, Tagging repos, Announcing, Reviewing release notes.
- Assign individuals to each role (could be rotatable, per release).
- Sofiane Bebert Decide and document how BTR will track issues