RCA: koa.2

RCA: koa.2

Summary of Issue

We tagged koa.2, but it wouldn't install.  The problem had been reported for some time, and already fixed on master.

Relevant Tickets

Timeline of Events

  • Jan 17: Fix on master: https://github.com/edx/configuration/pull/6248

  • Jan 26: Discourse thread reporting the problem: https://discuss.openedx.org/t/failed-to-install-koa-in-aws-ubuntu-20-04-mysql-client-error/4132

    • credible report

    • link to master fix

  • CI Koa reports:

    • Failed: Jan 27: Ansible task name: oraclejdk : Download Oracle Java

    • Failed: Jan 28: edxapp : install system packages on which LMS and CMS rely

    • Passed: Jan 28

    • Passed: Jan 29

    • Failed: Feb 1: Please ignore this failure as the app server termination was trigged by me.

    • Passed: Feb 2

  • Feb 2: Adolfo created BTR-61 in order to fix the Jan 26 reported issue before tagging Koa.

  • Feb 9: live-tagging koa.2

  • Feb 10: report that koa.2 won't install for anyone following the native installation instructions

    • Reported by Pierre 

    • Ned cherry-picked fix to koa.

    • Adolfo tagged koa.2a with the fix.

How did this happen?

  • Why didn't we catch this?

    • Why was Koa.2 tagged without the fix?

      • Why was Koa.2 tagged without testing the tag?

        • Why did CI not catch this?

          • Why is CI currently not running the exact native installation procedure?

            • Because it's running OpenCraft's customization, which didn't catch the database issue.

              • Why are we running OpenCraft's customizations?

                • Because OpenCraft offered, using their pre-existing sandbox builds. The sandbox builds are what they had already.

                • How might we enhance OpenCraft's CI implementation to run the exact native installation procedure.

          • CI is difficult to maintain - fails for obscure reasons.

        • Why did we believe that CI would catch it?

          • Why didn't we have the deltas between the native installation and what the CI was doing?

            • Because it isn't intended to not have large deltas.

              • For cost reasons, we are using the same database.

    • Because the platform is complex and can result in unanticipated issues.

    • Why did we miss addressing Koa open issues before tagging?

      • Why didn't we act on the discourse thread?

        • We did create a ticket for ourselves.

          • Why did we miss fixing BTR-61, which was tagged with Koa?

            • How might we use GitHub milestones to not miss such in the future?

              • Could have used Jira tickets with the koa.2 label.

            • Because the current playbook has the mechanics of tagging the repos, but doesn't cover process for ensuring all needed fixes are in.

              • How might we update our tagging playbook to catch this in the future?

                • Include concrete checklists for verifying that tickets were completed.

            • How might we include severity and priority on open issues?

        • Why did it take 2+ weeks to react?

          • Discourse → BTR ticket took a week.

            • Why are we relying on a single individual to do this?

              • How might we have more people helping others in the community in Discourse?

            • Why are we using multiple reporting tools? (Discourse, Jira and now GitHub issues)

        •  Why did we miss that this was a critical issue for the native install, for more than 1 week?

          • Because there are several issues at times.

    • Why did we not include a fix from master?

      • Master moves at a fast pace. We are not looking at each commit. We are looking at security fixes right now.

      • Fixes that are pushed to master are not included in fixes to other branches (named releases)?

        • Why did the PR author miss it?

          • How might we make use of Conventional Commits and/or Pull Request templates to improve catching this?

        • Why did the BTR group miss this?

    • Why did the Koa.2 installation break in the first place?

    • Why does the BTR group not "Dogfood" the native installation?

      • There needs to be a process or reason to do this.

      • Why is there no process for this?

      • Why is there no reason to do so? 

 

 

How could we have prevented it?

Action Items

@Former user (Deleted) Update the runbook as follows
Don't tag the release without manually installing it.
Ensure all must-have tickets are completed.
@Former user (Deleted) Write and distribute roles
Write down required roles for BTR
e.g. Manual QA, Triage issues, Discourse assistance, Tagging repos, Announcing, Reviewing release notes.
Assign individuals to each role (could be rotatable, per release).
@Sofiane Bebert (Unlicensed) Decide and document how BTR will track issues
Prioritization marking