RCA: koa.2

Summary of Issue

We tagged koa.2, but it wouldn't install.  The problem had been reported for some time, and already fixed on master.

Relevant Tickets

  • BTR-61 - Getting issue details... STATUS

Timeline of Events

  • Jan 17: Fix on master: https://github.com/edx/configuration/pull/6248
  • Jan 26: Discourse thread reporting the problem: https://discuss.openedx.org/t/failed-to-install-koa-in-aws-ubuntu-20-04-mysql-client-error/4132
    • credible report
    • link to master fix
  • CI Koa reports:
    • Failed: Jan 27: Ansible task name: oraclejdk : Download Oracle Java
    • Failed: Jan 28: edxapp : install system packages on which LMS and CMS rely
    • Passed: Jan 28
    • Passed: Jan 29
    • Failed: Feb 1: Please ignore this failure as the app server termination was trigged by me.
    • Passed: Feb 2
  • Feb 2: Adolfo created BTR-61 in order to fix the Jan 26 reported issue before tagging Koa.
  • Feb 9: live-tagging koa.2
  • Feb 10: report that koa.2 won't install for anyone following the native installation instructions
    • Reported by Pierre 
    • Ned cherry-picked fix to koa.
    • Adolfo tagged koa.2a with the fix.

How did this happen?

  • Why didn't we catch this?
    • Why was Koa.2 tagged without the fix?
      • Why was Koa.2 tagged without testing the tag?
        • Why did CI not catch this?
          • Why is CI currently not running the exact native installation procedure?
            • Because it's running OpenCraft's customization, which didn't catch the database issue.
              • Why are we running OpenCraft's customizations?
                • Because OpenCraft offered, using their pre-existing sandbox builds. The sandbox builds are what they had already.
                • How might we enhance OpenCraft's CI implementation to run the exact native installation procedure.
          • CI is difficult to maintain - fails for obscure reasons.
        • Why did we believe that CI would catch it?
          • Why didn't we have the deltas between the native installation and what the CI was doing?
            • Because it isn't intended to not have large deltas.
              • For cost reasons, we are using the same database.
    • Because the platform is complex and can result in unanticipated issues.
    • Why did we miss addressing Koa open issues before tagging?
      • Why didn't we act on the discourse thread?
        • We did create a ticket for ourselves.
          • Why did we miss fixing BTR-61, which was tagged with Koa?
            • How might we use GitHub milestones to not miss such in the future?
              • Could have used Jira tickets with the koa.2 label.
            • Because the current playbook has the mechanics of tagging the repos, but doesn't cover process for ensuring all needed fixes are in.
              • How might we update our tagging playbook to catch this in the future?
                • Include concrete checklists for verifying that tickets were completed.
            • How might we include severity and priority on open issues?
        • Why did it take 2+ weeks to react?
          • Discourse → BTR ticket took a week.
            • Why are we relying on a single individual to do this?
              • How might we have more people helping others in the community in Discourse?
            • Why are we using multiple reporting tools? (Discourse, Jira and now GitHub issues)
        •  Why did we miss that this was a critical issue for the native install, for more than 1 week?
          • Because there are several issues at times.
    • Why did we not include a fix from master?
      • Master moves at a fast pace. We are not looking at each commit. We are looking at security fixes right now.
      • Fixes that are pushed to master are not included in fixes to other branches (named releases)?
        • Why did the PR author miss it?
          • How might we make use of Conventional Commits and/or Pull Request templates to improve catching this?
        • Why did the BTR group miss this?
    • Why did the Koa.2 installation break in the first place?
    • Why does the BTR group not "Dogfood" the native installation?
      • There needs to be a process or reason to do this.
      • Why is there no process for this?
      • Why is there no reason to do so? 


How could we have prevented it?

Action Items

  • Former user (Deleted) Update the runbook as follows
    • Don't tag the release without manually installing it.
    • Ensure all must-have tickets are completed.
  • Former user (Deleted) Write and distribute roles
    • Write down required roles for BTR
      • e.g. Manual QA, Triage issues, Discourse assistance, Tagging repos, Announcing, Reviewing release notes.
    • Assign individuals to each role (could be rotatable, per release).
  • Sofiane Bebert Decide and document how BTR will track issues
    • Prioritization marking