Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Overview

Open edX is a large software project with many dependencies that periodically need to be upgraded. This runbook outlines the steps that should be taken each time we embark on a large software upgrade or maintenance project, to minimize the total amount of effort required to get it done and maximize the likelihood of completing the project on schedule.

Identify the Need Early

The first step is to realize an upgrade is even needed. We track the major Open edX dependencies which trigger recurring upgrade projects in this spreadsheet (which includes a link at the bottom to the code that generates it). The data is reviewed quarterly to identify upcoming upgrade needs, but anyone is welcome to create an issue or pull request to flag a missing dependency or stale data. The more complete this is, the better we can plan ahead. The official web sites of each software project are the authoritative source of information about support window end dates, although endoflife.date is also a very useful resource.

It’s a good idea to create a GitHub Issue for the upgrade as soon as the need for a major upgrade is identified, before even finalizing the target completion date; then it can serve as a home for discussion of the upgrade and links to relevant documentation as it gets written. To do this:

  • Create the issue in the platform-roadmap repository, and give it the “maintenance” label.

  • Add the new issue to the Backlog column of the https://github.com/orgs/openedx/projects/4/views/1 project. Set any of the project’s custom fields whose values are known; common choices may include:

    • “Proposed by” - your organization’s name

    • “Platform map - Super Level” - Architecture/Platform

    • “Strategy” - Platform

    • “Type” - Maintenance

Schedule the Completion Date

Next is to decide when to start and complete the upgrade. To decide the end date of the upgrade project:

  1. Find the date when support of the version currently in use ends.

  2. Find the last date prior to that when the branches will be cut for a new Open edX release, according to the Open edX release schedule .

  3. If there are compelling releases to upgrade earlier, pick the branch cut date for an earlier release as appropriate. (For example, major React versions get security patches indefinitely, but more and more related packages start requiring newer versions to work correctly.)

  4. Set the roadmap issue’s milestone to be the corresponding release name.

  5. If other upgrade projects are slated for the same Open edX release, stagger the completion dates if at all possible. We don’t want to be struggling to complete 3 major upgrades concurrently with wrapping up work on the release. Add an explicit completion date to the roadmap issue, and explain why it was chosen.

Select an Orchestration Team

Once there’s a good idea of “what” and “when”, the next step is to figure out “who”. A team (not an individual) should be selected to plan for the upgrade and make sure it gets completed on time. Some selection criteria:

  • The orchestration team should be doing full time work related to the software being upgraded. Part-time assistance with the upgrade from other developers is actively encouraged, but the orchestration team should be able to allocate large amounts of their time to keeping the project on track to timely completion.

  • The team must have some spare capacity in the near future for the next few steps, and able to dedicate a lot more time to the upgrade as the deadline approaches.

  • The team must have some expertise in the software being upgraded, or be able to develop that expertise at the start of the project. This is needed in order to create an efficient implementation plan and write guidance for other teams that will need to do work for the upgrade.

  • The team should not also be working on another major project with a similar due date.

Document in the roadmap issue which team will be orchestrating the upgrade to avoid future confusion about who is responsible for this.

Determine Scope of Impact

Now that we have a target date and people dedicated to working on the upgrade, we can determine the scope of the upgrade in order to notify those who will be impacted by it. The orchestration team should perform this step.

  1. Find or create a repo health check to identify all the repositories using the software to be upgraded.

  2. Create an issue for the upgrade in each impacted repository, and add it to the Todo column of the https://github.com/orgs/edx/projects/17/views/1 board.

    1. Keep the description minimal, mainly link to the upgrade issue in the roadmap. Dates, instructions, etc. are likely to be fleshed out and/or changed later and we don’t want to have to update them in dozens of issues.

    2. Create a “Project” field value for the upgrade project, and specify it in the Table view for each created issue.

    3. Specify the orchestration team as the “Source” in the Table view.

    4. Specify the owner in the Table view if you can easily identify it. (2U employees and Arbisoft contractors can already do this via a private ownership spreadsheet, but I think Backstage will be the future home of this information for others.)

    5. Leave the “Owner Does” field set to “TBD” (To Be Determined) for now.

    6. Create a task list in the roadmap issue to track all the child issues for individual repositories.

  3. Create a page under Upgrades for documentation related to the upgrade. Add links from this page to the roadmap issue and vice versa.

  4. Read the release notes for the version being upgraded to (and any other versions skipped over during the upgrade), and document (under the Confluence page created in the prior step) the changes believed to be most problematic and/or interesting.

Note that we want to create the per-repository issues pretty early in the process, to give teams advance notice that they need to allocate some time for the work. More detailed communications about the expected level of effort and available automation will come later, but even just this heads up that “you need to leave some room in your schedule for this” is useful.

Automate As Much As Practical

Now that it’s clear what needs to be done, it’s time for the orchestration team to write automation for the project wherever the time savings of avoiding manual work outweigh the time to implement the automation. Good candidates include:

  • Find or write codemods to automatically make some of the necessary changes to source code. See Codemods and Other Upgrade Automation for guidance on tools we’ve found or written already to do this for some of our dependencies.

  • Write configuration file modification scripts to automatically make appropriate updates to testing matrices and other metadata in tox.ini, GitHub Actions workflows, setup.py, etc. We keep these in the repo-tools repository, you can create a new directory there and copy scripts from recent upgrades to use as starting points.

  • Create repo health checks and dashboard(s) to automatically detect if key milestones (like Trove classifiers for appropriate Python/Django version support) in the upgrade have been achieved in a repository, so an auto-updated dashboard can be created instead of manual wiki table updates. Consider also creating a dashboard that runs these checks on the repositories of other software we depend on, to track how many of them still aren’t ready for the upgrade; this is otherwise a very manual, time-consuming task for some types of upgrades (like Python or Django).

  • Create a new view in the https://github.com/orgs/edx/projects/17/views/1 board that filters down to just issues in this upgrade project.

  • Document instructions and available automation for those who will be doing work to prepare repositories for the upgrade. Link to this document in the roadmap issue.

Clarify Distribution of Work

Now that we’ve done as much as we reasonably can to automate the upgrade, we need to ask the teams owning repositories impacted by the upgrade what they want their role in the upgrade to be. The orchestration team should send an announcement asking teams to read the upgrade instructions and select one of the Upgrade Service Levels for each impacted repository. These choices should be specified in the “Owner Does” field of the corresponding issue on the https://github.com/orgs/edx/projects/17/views/1 board (this is easiest to do in the Table view).

Any external dependencies which still need updates to support the upgrade should have issues created in https://github.com/openedx/public-engineering with the “help wanted” label added. These issues should link to the roadmap issue for the upgrade, the Handling Outdated Dependencies page, and the upgrade instructions document, and should be added to the https://github.com/orgs/edx/projects/17/views/1 board. Once these are created, the orchestration team should ask the broader Open edX developer community for assistance with these issues. These can be completed independently of the other upgrade work if done early enough, but may have a long turnaround time depending on the availability of upstream maintainers, so we really want to encourage early action on these. And the calendar time needed to complete these dependency upgrades is shortest if we have a lot of people reaching out in parallel to get this work started, rather than queuing all the upgrade communications on a small team of people.

If the deployment will require nontrivial operations work, there should also be a ticket for the SRE team at 2U (and any other organization publicly committed to releasing from the main development branches) to prepare for any necessary database upgrades, etc. This typically isn’t needed for Django upgrades, but is for MongoDB, MySQL, Ubuntu, etc. This should also go on the https://github.com/orgs/edx/projects/17/views/1 board.

Update the Code

Now that we’ve discovered or written any appropriate upgrade automation, documented how to perform the upgrade, and distributed the work appropriately, it’s time to actually update the code. The initial focus should be on external dependencies and Open edX repositories at the end of the dependency chain (with few or no dependencies of their own which also need to be upgraded). Services can only be fully updated once all of their dependencies have been updated, although some automated and manual work to replace deprecated code usage can be started earlier. Owning teams and the orchestration team work in parallel to update Open edX code while Core Committers and other Open edX community developers work on the issues to work with upstream maintainers of our external dependencies to get those updated.

If new information is discovered that could help with other parts of the upgrade (solutions for tricky parts of the upgrade, new codemods for handling some of the code updates, unanticipated problem points to watch out for, etc.), the project documentation should be updated and the changes should be announced to other participants in the upgrade project.

The status of issues on the https://github.com/orgs/edx/projects/17/views/1 board should be kept up to date as work progresses. The project view on that board and the repository health dashboards provide a relatively real-time view of the status of the overall project, minimizing the need for manual dashboard creation in Confluence or elsewhere.

Deploy the Upgrade

As soon as a package is updated to support both the old and new versions of the major dependency being upgraded, a new release should be made to allow other Open edX code to start using it. As soon as the code for an entire service and its dependencies is ready for the upgrade, and any required SRE work has been completed, organizations like 2U that deploy directly from the main development branches should attempt the upgrade. There’s typically no need to wait until all services are ready to upgrade the first of them; we prefer to deploy the changes out incrementally.

If the target date is getting close and some services seem at risk of not making the deadline, the orchestration team should work with the owning team(s) to make sure sufficient assistance is made available to get them done.

Make sure the deployment process gets documented for the benefit of site operators running named releases of Open edX, who will typically be deploying the updated the updated code 1-3 months later. (Or in some cases, a few years later after the people involved in the upgrade have moved on or forgotten much of the context and details.)

Clean Up

Once the upgrade is successfully coded and deployed, there’s usually still some clean up work left to do:

  1. Update the support windows spreadsheet to correctly indicate the version now used.

  2. Announce the successful completion of the upgrade!

  3. Make sure any deployment instructions make it into the release notes of the next Open edX named release. Communicate with the Build-Test-Release Working Group to make sure this gets done satisfactorily.

  4. Remove any CI matrix entries for no-longer-supported versions of the dependency. For packages, this typically can’t be done until all of the services using them have completed the upgrade. Services are more free to do this relatively soon after the upgrade is successfully deployed.

  5. Remove the https://github.com/orgs/edx/projects/17/views/1 board view created for the upgrade.

  6. Stop running repo health checks which are no longer relevant. We may just want to skip the checks rather than deleting them, to make it easier to reuse them for the next similar upgrade.

  7. Mark the roadmap issue for the upgrade as complete.

  8. Do a retrospective meeting + asynchronous conversation to collect feedback on what went well, what could have been better, and what we can do even better next time. Update this document and the pages it links to if appropriate, and write and assign issues for any necessary followup work.

  9. Move the root Confluence page for this upgrade’s docs under Past Upgrades.

  10. If any external dependencies were discovered to be unsatisfactorily maintained, schedule work to replace them or (if necessary) take over maintenance of them ourselves. See Handling Outdated Dependencies for more details on this.

  • No labels