Upgrade Project Runbook
Overview
Open edX is a large software project with many dependencies that periodically need to be upgraded. This runbook outlines the steps that should be taken each time we embark on a large software upgrade or maintenance project, to minimize the total amount of effort required to get it done and maximize the likelihood of completing the project on schedule.
Identify the Need Early
The first step is to realize an upgrade is even needed. We track the major Open edX dependencies which trigger recurring upgrade projects in this spreadsheet (which includes a link at the bottom to the code that generates it). The data is reviewed quarterly by 2U’s Arbi-BOM squad to identify upcoming upgrade needs, but anyone is welcome to create an issue or pull request to flag a missing dependency or stale data. The more complete this is, the better we can plan ahead. The official web sites of each software project are the authoritative source of information about support window end dates, although endoflife.date is also a very useful resource.
It’s a good idea to create a GitHub Issue for the upgrade as soon as the need for a major upgrade is identified, before even finalizing the target completion date; then it can serve as a home for discussion of the upgrade and links to relevant documentation as it gets written. To do this:
Create the issue in the platform-roadmap repository, and give it the “maintenance” label.
After the issue creation form is submitted, edit the issue’s description to add the checklist at the bottom of this page.
Add the new issue to the Backlog column of the https://github.com/orgs/openedx/projects/4/views/1 project. Set any of the project’s custom fields whose values are known; common choices may include:
“Proposed by” - your organization’s name
“Platform map - Super Level” - Architecture/Platform
“Strategy” - Platform
“Type” - Maintenance
Schedule the Completion Date
Next is to decide when to start and complete the upgrade. To decide the end date of the upgrade project:
Find the date when support of the version currently in use ends.
Find the last date prior to that when the branches will be cut for a new Open edX release, according to the Open edX Release Schedule .
If there are compelling releases to upgrade earlier, pick the branch cut date for an earlier release as appropriate. (For example, major React versions get security patches indefinitely, but more and more related packages start requiring newer versions to work correctly.)
Set the roadmap issue’s milestone to be the corresponding release name.
If other upgrade projects are slated for the same Open edX release, stagger the completion dates if at all possible. We don’t want to be struggling to complete 3 major upgrades concurrently with wrapping up work on the release. Add an explicit completion date to the roadmap issue, and explain why it was chosen.
Select an Orchestration Team
Once there’s a good idea of “what” and “when”, the next step is to figure out “who”. A team (not an individual) should be selected to plan for the upgrade and make sure it gets completed on time. Some selection criteria:
The orchestration team should be doing full time work related to the software being upgraded. Part-time assistance with the upgrade from other developers is actively encouraged, but the orchestration team should be able to allocate large amounts of their time to keeping the project on track to timely completion.
The team must have some spare capacity in the near future for the next few steps, and able to dedicate a lot more time to the upgrade as the deadline approaches.
The team must have some expertise in the software being upgraded, or be able to develop that expertise at the start of the project. This is needed in order to create an efficient implementation plan and write guidance for other teams that will need to do work for the upgrade.
The team should not also be working on another major project with a similar due date.
Document in the roadmap issue which team will be orchestrating the upgrade to avoid future confusion about who is responsible for this.
Determine Scope of Impact
Now that we have a target date and people dedicated to working on the upgrade, we can determine the scope of the upgrade in order to notify those who will be impacted by it. The orchestration team should perform this step.
Find or create a repo health check to identify all the repositories using the software to be upgraded.
Create an issue for the upgrade in each impacted repository, and add it to the Todo column of the https://github.com/orgs/openedx/projects/51/views/1 board.
Keep the description minimal, mainly link to the upgrade issue in the roadmap. Dates, instructions, etc. are likely to be fleshed out and/or changed later and we don’t want to have to update them in dozens of issues.
Create a “Project” field value for the upgrade project, and specify it in the Table view for each created issue.
Specify the orchestration team as the “Source” in the Table view.
Specify the owner in the Table view if you can easily identify it. (2U employees and Arbisoft contractors can already do this via a private ownership spreadsheet, but I think Backstage will be the future home of this information for others.)
Leave the “Owner Does” field set to “TBD” (To Be Determined) for now.
Create a task list in the roadmap issue to track all the child issues for individual repositories.
Create a page under Upgrades for documentation related to the upgrade. Add links from this page to the roadmap issue and vice versa.
Read the release notes for the version being upgraded to (and any other versions skipped over during the upgrade), and document (under the Confluence page created in the prior step) the changes believed to be most problematic and/or interesting.
Note that we want to create the per-repository issues pretty early in the process, as they serve a few important roles:
They give teams advance notice that they need to allocate some time for the work. More detailed communications about the expected level of effort and available automation will come later, but even just this heads up that “you need to leave some room in your schedule for this” is useful.
They allow teams to specify their preferred level of involvement in upgrading each repository. Some will prefer to be hands-off and let the orchestration team handle as much as practical, others will have reasons why they want to do most of the upgrade work themselves in a particular repository.
They provide an official forum for discussing repository-specific aspects of the upgrade project.
They allow the https://github.com/orgs/openedx/projects/51/views/1 board to serve as a status dashboard for the upgrade project, by filtering to just the issues in that project.
Automate As Much As Practical
Now that it’s clear what needs to be done, it’s time for the orchestration team to write automation for the project wherever the time savings of avoiding manual work outweigh the time to implement the automation. Good candidates include:
Find or write codemods to automatically make some of the necessary changes to source code. See Codemods and Other Upgrade Automation for guidance on tools we’ve found or written already to do this for some of our dependencies.
Write configuration file modification scripts to automatically make appropriate updates to testing matrices and other metadata in
tox.ini
, GitHub Actions workflows,setup.py
, etc. We keep these in the repo-tools repository, you can create a new directory there and copy scripts from recent upgrades to use as starting points.Create repo health checks and dashboard(s) to automatically detect if key milestones (like Trove classifiers for appropriate Python/Django version support) in the upgrade have been achieved in a repository, so an auto-updated dashboard can be created instead of manual wiki table updates. Consider also creating a dashboard that runs these checks on the repositories of other software we depend on, to track how many of them still aren’t ready for the upgrade; this is otherwise a very manual, time-consuming task for some types of upgrades (like Python or Django).
Create a new view in the https://github.com/orgs/openedx/projects/51/views/1 board that filters down to just issues in this upgrade project.
Document instructions and available automation for those who will be doing work to prepare repositories for the upgrade. Link to this document in the roadmap issue.
Clarify Distribution of Work
Now that we’ve done as much as we reasonably can to automate the upgrade, we need to ask the teams owning repositories impacted by the upgrade what they want their role in the upgrade to be. The orchestration team should send an announcement (as described in Upgrade-Related Announcements) asking teams to read the upgrade instructions and select one of the Upgrade Service Levels for each impacted repository. These choices should be specified in the “Owner Does” field of the corresponding issue on the https://github.com/orgs/openedx/projects/51/views/1 board (this is easiest to do in the Table view).
Any external dependencies which still need updates to support the upgrade should have issues created in https://github.com/openedx/public-engineering with the “help wanted” and “maintenance” labels added. These issues should link to the roadmap issue for the upgrade, the Handling Outdated Dependencies page, and the upgrade instructions document, and should be added to the https://github.com/orgs/openedx/projects/51/views/1 board. Once these are created, the orchestration team should ask the broader Open edX developer community for assistance with these issues. These can be completed independently of the other upgrade work if done early enough, but may have a long turnaround time depending on the availability of upstream maintainers, so we really want to encourage early action on these. And the calendar time needed to complete these dependency upgrades is shortest if we have a lot of people reaching out in parallel to get this work started, rather than queuing all the upgrade communications on a small team of people.
If the deployment will require nontrivial operations work, there should also be tickets for the Build-Test-Release Working Group and the SRE team at 2U (and any other organization publicly committed to releasing from the main development branches) to prepare for any necessary database upgrades, etc. These typically aren’t needed for Django upgrades, but are for MongoDB, MySQL, Ubuntu, etc. These should also go on the https://github.com/orgs/openedx/projects/51/views/1 board.
Update the Code
Now that we’ve discovered or written any appropriate upgrade automation, documented how to perform the upgrade, and distributed the work appropriately, it’s time to actually update the code. The initial focus should be on external dependencies and Open edX repositories at the end of the dependency chain (with few or no dependencies of their own which also need to be upgraded). Services can only be fully updated once all of their dependencies have been updated, although some automated and manual work to replace deprecated code usage can be started earlier. Owning teams and the orchestration team work in parallel to update Open edX code while Core Committers and other Open edX community developers work on the issues to work with upstream maintainers of our external dependencies to get those updated.
If new information is discovered that could help with other parts of the upgrade (solutions for tricky parts of the upgrade, new codemods for handling some of the code updates, unanticipated problem points to watch out for, etc.), the project documentation should be updated and the changes should be announced to other participants in the upgrade project.
The status of issues on the https://github.com/orgs/openedx/projects/51/views/1 board should be kept up to date as work progresses; one person on the orchestration team should be assigned ultimate responsibility for making sure this happens, although they are free to procure assistance from other team members and/or a project manager as appropriate. The project view on that board and the repository health dashboards provide a relatively real-time view of the status of the overall project, minimizing the need for manual dashboard creation in Confluence or elsewhere.
Deploy the Upgrade
As soon as a package is updated to support both the old and new versions of the major dependency being upgraded, a new release should be made to allow other Open edX code to start using it. As soon as the code for an entire service and its dependencies is ready for the upgrade, and any required SRE work has been completed, organizations like 2U that deploy directly from the main development branches should attempt the upgrade. There’s typically no need to wait until all services are ready to upgrade the first of them; we prefer to deploy the changes out incrementally.
If the target date is getting close and some services seem at risk of not making the deadline, the orchestration team should work with the owning team(s) to make sure sufficient assistance is made available to get them done.
Make sure the deployment process gets documented for the benefit of site operators running named releases of Open edX, who will typically be deploying the updated the updated code 1-3 months later. (Or in some cases, a few years later after the people involved in the upgrade have moved on or forgotten much of the context and details.)
Clean Up
Once the upgrade is successfully coded and deployed, there’s usually still some clean up work left to do:
Update the support windows spreadsheet to correctly indicate the version now used.
Announce the successful completion of the upgrade!
Make sure any deployment instructions make it into the release notes of the next Open edX named release. Communicate with the Build-Test-Release Working Group to make sure this gets done satisfactorily.
Remove any CI matrix entries for no-longer-supported versions of the dependency. For packages, this typically can’t be done until all of the services using them have completed the upgrade. Services are more free to do this relatively soon after the upgrade is successfully deployed.
Remove the https://github.com/orgs/openedx/projects/51/views/1 board view created for the upgrade.
Stop running repo health checks which are no longer relevant. We may just want to skip the checks rather than deleting them, to make it easier to reuse them for the next similar upgrade.
Mark the roadmap issue for the upgrade as complete.
Do a retrospective meeting + asynchronous conversation to collect feedback on what went well, what could have been better, and what we can do even better next time. Update this document and the pages it links to if appropriate, and write and assign issues for any necessary followup work.
Move the root Confluence page for this upgrade’s docs under Past Upgrades.
If any external dependencies were discovered to be unsatisfactorily maintained, schedule work to replace them or (if necessary) take over maintenance of them ourselves. See Handling Outdated Dependencies for more details on this.
Appendix: Roadmap Issue Checklist
Add the following block of Markdown to the description of the upgrade’s main roadmap issue, and check items off as they are completed.
Please use the following checklist to perform the upgrade according to the [Upgrade Project Runbook](https://openedx.atlassian.net/wiki/spaces/AC/pages/3660316693/Upgrade+Project+Runbook), distilled from lessons learned during previous major upgrade projects. See the full runbook for more details on each step.
```[tasklist]
### Tasks
- [ ] [Create a roadmap issue](https://github.com/openedx/platform-roadmap/issues/new/choose) in the platform-roadmap repository
- [ ] Add this checklist to the roadmap issue's description
- [ ] Add the "maintenance" label to the roadmap issue
- [ ] Add the roadmap issue to the Backlog column of the [Open edX Roadmap](https://github.com/orgs/openedx/projects/4/views/1) project
- [ ] Set appropriate values for the roadmap project's custom fields for the issue (especially "Proposed by", "Platform map - Super Level", "Strategy", and "Type")
- [ ] Set an appropriate release milestone for the roadmap issue
- [ ] Add an explicit target completion date to the roadmap issue description, and explain there why it was chosen
- [ ] Select an orchestration team
- [ ] Name the orchestration team in the roadmap issue description
- [ ] Create a repo health check to identify most/all of the repositories impacted by the upgrade (and ideally, whether or not the upgrade is believed to be complete)
- [ ] Create a new value for the "Project" field in the [Maintenance](https://github.com/orgs/openedx/projects/51/views/1) project board for this upgrade project
- [ ] Create a new view in the Maintenance project board that filters down to only the issues in this upgrade project
- [ ] Create an issue in each impacted repository for the upgrade, and add it to the Todo column of the [Maintenance](https://github.com/orgs/openedx/projects/51/views/1) project board; specify at least the "Project" and "Source" field for each issue (and "Owner" also if you're a 2U or Arbisoft employee)
- [ ] Create a [task list](https://docs.github.com/en/issues/tracking-your-work-with-issues/about-tasklists) in the roadmap issue listing all of the impacted repository issues
- [ ] Create a page under [Upgrades](https://openedx.atlassian.net/wiki/spaces/AC/pages/1165395730) in Confluence for documentation related to the upgrade
- [ ] Add a link to the Confluence page from the roadmap issue
- [ ] Add a link to the roadmap issue from the Confluence page
- [ ] Document in a Confluence child page the changes believed to be most problematic and/or interesting about the upgrade
- [ ] Create a ticket to determine the appropriate amount of automation (codemods, repo health checks, etc.) to create for the upgrade
- [ ] Perform the automation discovery work and write upgrade instructions for all project participants in a Confluence child page
- [ ] Link to the upgrade instructions from the roadmap issue
- [ ] Send an [announcement](https://openedx.atlassian.net/wiki/spaces/AC/pages/3702325257/Upgrade-Related+Announcements) of the upgrade, asking code maintainers to read the upgrade instructions and select an upgrade service level for each impacted repository in its corresponding issue
- [ ] Create issues in [public-engineering](https://github.com/openedx/public-engineering) for each external dependency which still needs code changes and/or a release to support the upgrade
- [ ] Add the "help wanted" and "maintenance" labels to each public-engineering issue created above
- [ ] Add each public-engineering issue created above to the Maintenance project board
- [ ] Set appropriate values for the Maintenance project board custom fields for each added public-engineering issue
- [ ] Ask the Open edX developer community (especially Core Committers) for assistance with the added public-engineering issues
- [ ] Create [Build-Test-Release Working Group](https://github.com/orgs/openedx/projects/28) and/or 2U SRE tickets if they will need to do work for the upgrade
- [ ] Complete and update all of the created implementation tickets
- [ ] Deploy all of the updated services
- [ ] Update the [support windows spreadsheet](https://docs.google.com/spreadsheets/d/11DheEtMDGrbA9hsUvZ2SEd4Cc8CaC4mAfoV8SVaLBGI/edit#gid=195838733) to correctly indicate the version now used
- [ ] [Announce](https://openedx.atlassian.net/wiki/spaces/AC/pages/3702325257) the successful completion of the upgrade
- [ ] Make sure any deployment instructions make it into the release notes of the next Open edX named release (collaborate with the BTR WG)
- [ ] Remove any CI matrix entries for no-longer-supported versions of the dependency
- [ ] Remove the Maintenance project board view/tab created for the upgrade
- [ ] Stop running repo health checks related to the upgrade which are no longer relevant
- [ ] Schedule and run a retrospective meeting about the upgrade
- [ ] Update the [Upgrade Project Runbook](https://openedx.atlassian.net/wiki/spaces/AC/pages/3660316693/Upgrade+Project+Runbook) based on retrospective findings, if appropriate
- [ ] Mark the roadmap issue for the upgrade as complete
- [ ] Move the root Confluence page for this upgrade’s docs under [Past Upgrades](https://openedx.atlassian.net/wiki/spaces/AC/pages/1883865104)
- [ ] Ticket work to replace or take over dependencies which were found during the upgrade to be inadequately maintained
```