RCA: koa.2
Summary of Issue
We tagged koa.2, but it wouldn't install. The problem had been reported for some time, and already fixed on master.
Relevant Tickets
Timeline of Events
- Jan 17: Fix on master: https://github.com/edx/configuration/pull/6248
- Jan 26: Discourse thread reporting the problem: https://discuss.openedx.org/t/failed-to-install-koa-in-aws-ubuntu-20-04-mysql-client-error/4132
- credible report
- link to master fix
- CI Koa reports:
- Failed: Jan 27: Ansible task name: oraclejdk : Download Oracle Java
- Failed: Jan 28: edxapp : install system packages on which LMS and CMS rely
- Passed: Jan 28
- Passed: Jan 29
- Failed: Feb 1: Please ignore this failure as the app server termination was trigged by me.
- Passed: Feb 2
- Feb 2: Adolfo created BTR-61 in order to fix the Jan 26 reported issue before tagging Koa.
- Feb 9: live-tagging koa.2
- Feb 10: report that koa.2 won't install for anyone following the native installation instructions
- Reported by Pierre
- Ned cherry-picked fix to koa.
- Adolfo tagged koa.2a with the fix.
How did this happen?
- Why didn't we catch this?
- Why was Koa.2 tagged without the fix?
- Why was Koa.2 tagged without testing the tag?
- Why did CI not catch this?
- Why is CI currently not running the exact native installation procedure?
- Because it's running OpenCraft's customization, which didn't catch the database issue.
- Why are we running OpenCraft's customizations?
- Because OpenCraft offered, using their pre-existing sandbox builds. The sandbox builds are what they had already.
- How might we enhance OpenCraft's CI implementation to run the exact native installation procedure.
- Why are we running OpenCraft's customizations?
- Because it's running OpenCraft's customization, which didn't catch the database issue.
- CI is difficult to maintain - fails for obscure reasons.
- Why is CI currently not running the exact native installation procedure?
- Why did we believe that CI would catch it?
- Why didn't we have the deltas between the native installation and what the CI was doing?
- Because it isn't intended to not have large deltas.
- For cost reasons, we are using the same database.
- Because it isn't intended to not have large deltas.
- Why didn't we have the deltas between the native installation and what the CI was doing?
- Why did CI not catch this?
- Why was Koa.2 tagged without testing the tag?
- Because the platform is complex and can result in unanticipated issues.
- Why did we miss addressing Koa open issues before tagging?
- Why didn't we act on the discourse thread?
- We did create a ticket for ourselves.
- Why did we miss fixing BTR-61, which was tagged with Koa?
- How might we use GitHub milestones to not miss such in the future?
- Could have used Jira tickets with the koa.2 label.
- Because the current playbook has the mechanics of tagging the repos, but doesn't cover process for ensuring all needed fixes are in.
- How might we update our tagging playbook to catch this in the future?
- Include concrete checklists for verifying that tickets were completed.
- How might we update our tagging playbook to catch this in the future?
- How might we include severity and priority on open issues?
- How might we use GitHub milestones to not miss such in the future?
- Why did we miss fixing BTR-61, which was tagged with Koa?
- Why did it take 2+ weeks to react?
- Discourse → BTR ticket took a week.
- Why are we relying on a single individual to do this?
- How might we have more people helping others in the community in Discourse?
- Why are we using multiple reporting tools? (Discourse, Jira and now GitHub issues)
- Why are we relying on a single individual to do this?
- Discourse → BTR ticket took a week.
- Why did we miss that this was a critical issue for the native install, for more than 1 week?
- Because there are several issues at times.
- We did create a ticket for ourselves.
- Why didn't we act on the discourse thread?
- Why did we not include a fix from master?
- Master moves at a fast pace. We are not looking at each commit. We are looking at security fixes right now.
- Fixes that are pushed to master are not included in fixes to other branches (named releases)?
- Why did the PR author miss it?
- How might we make use of Conventional Commits and/or Pull Request templates to improve catching this?
- Why did the BTR group miss this?
- Why did the PR author miss it?
- Why did the Koa.2 installation break in the first place?
- Why does the BTR group not "Dogfood" the native installation?
- There needs to be a process or reason to do this.
- Why is there no process for this?
- Why is there no reason to do so?
- Why was Koa.2 tagged without the fix?
How could we have prevented it?
Action Items
- Former user (Deleted) Update the runbook as follows
- Don't tag the release without manually installing it.
- Ensure all must-have tickets are completed.
- Former user (Deleted) Write and distribute roles
- Write down required roles for BTR
- e.g. Manual QA, Triage issues, Discourse assistance, Tagging repos, Announcing, Reviewing release notes.
- Assign individuals to each role (could be rotatable, per release).
- Write down required roles for BTR
- Sofiane Bebert Decide and document how BTR will track issues
- Prioritization marking