Mongo 4.2 Upgrade

Overview

edX uses Mongo to store courses. This service is used by two IDAs - edx-platform and cs-comments-service.

There is one MongoDB cluster in Stage, one in Prod and one in Edge. All three are currently on version 4.0.

MongoDB version 4.0 is end of life in April 2022. With that in mind, edX wants to make sure that the next supported version is included in an Open edX release that will be issued prior to this EOL. The next supported release is scheduled to be cut in late October 2021, and will include both IDAs.

History

Previous upgrade was done by Imran (Lahore SRE), see https://openedx.atlassian.net/wiki/spaces/SRE/pages/2293597054. For prerequisites see https://openedx.atlassian.net/browse/ISRE-693, also https://openedx.atlassian.net/browse/ISRE-779 for Devstack. We also upgraded MongoDB clusters for the Yonkers instance, which we had to do in several steps: https://openedx.atlassian.net/browse/YONK-1891.

Ownership

The owner of MongoDB service is SRE.

While the owner of edx-platform is the Arch-BOM squad, modulestore (the part of edx-platform that uses Mongo) is owned by T&L.

cs-comments-service is effectively owned by the Infinity squad, formerly T&L Pakistan (even though the ownership sheet doesn’t reflect Infinity squad yet).

Goal

Upgrade Mongo clusters to 4.2 by end of August 2021.

 

Scope

Upgrade and verify edx-platform and cs-comments-service from 4.0 to MongoDB 4.2 in Prod and Edge.

Upgrade libraries in these IDAs as needed.

Path to Prod and Edge will be devstack -> sandbox -> stage -> prod/edge.

All work will be tracked under https://openedx.atlassian.net/browse/PSRE-877.

Plan

As agreed with the TNL and SRE teams, discovery and development work will be done by the SRE team, while TNL will be asked to review PRs and test on Devstack and other environments.

According to the discovery done by the SRE team, we cannot jump 4.0 to 4.4, skipping 4.2 in between, and instead need to do this upgrade in two steps - first upgrade to v.4.2 and then to 4.4. (See 4.4 Blocker ).

The plan is to upgrade devstack to v.4.2 and take it all the the way through Prod/Edge, and then 4.2 to v.4.4, again, starting from devstack through Prod/Edge.

4.4 Blocker

We’ve discovered that the backup via Mongo cloud we’ve been using isn’t compatible with Mongo 4.4 (FCV 4.2) so we will be unable to upgrade past 4.2 until we figure out a solution to the backup issue.

https://docs.cloudmanager.mongodb.com/core/backup-preparations/#requirements-and-limitations

We are using MongoDB community edition for free of cost in our environment and starting from mongo 4.2 (FCV 4.2) backup feature in mongo cloud manager is only supported for databases running MongoDB enterprise edition which is available as part of MongoDB enterprise advanced subscription and it is paid option.

Upgrading to 4.2 should be sufficient for the next Open edX release, since end of life has not been announced for it.

If we decide to pursue 4.4 in the future, tickets we created and cancelled for it can be found on this page: https://openedx.atlassian.net/wiki/spaces/AC/pages/2928443423

Dependencies

Mongo upgrade can be done independently from other upgrade projects, e.g. https://openedx.atlassian.net/wiki/spaces/AC/pages/2923561087 , https://openedx.atlassian.net/wiki/spaces/AC/pages/2844426436 , etc. We have not identified any other dependencies.

Each upgraded IDA can be release independently, as soon as it is done. However, we want to make sure we don’t roll out to Prod and Edge in the same deployment with any other service upgrades, such as mentioned above.

Risks and Mitigation

One known risk is performance issues once we upgrade Prod, due to size of the DB there. Stage DB is much smaller than Prod DB, so scaling might be an issue which we would not be able to detect during review and testing in Sandbox or Stage.

Rollout and Contingency

  • Rollout Plan

    • Stage will be upgraded, then prod and edge

    • New instances on the new version are added to the cluster one at a time andthe data is synced

    • There is 30 seconds or less of downtime when the cluster cuts over to a primary node on the new version

      • No downtime notifications should be needed

    • Old version nodes are then removed from the cluster

  • Rollback plan

    • “Rollback” would be swapping back to older version as primary node if switching primary to new version causes problems.

    • Cluster will have nodes of two versions at the same time for a while and app servers will start using nodes running newer Mongo as soon as they complete syncing data

    • Rollback to old version can be done until we update the FeatureCompatibilityVersion
      variable in Mongo. After it goes forward we can’t go backward without dump and restore.

Timeline

Though we need to complete this anytime before October 15, SRE teams wants to complete in August 2021 if possible.

Project Team

Name

Squad

Role

Name

Squad

Role

@lcicchese

SRE

Service Owner

@Natalia Berdnikov

Open edX

Project Manager

@Joe Mulloy

SRE (Cambridge)

Lead SRE Engineer

@Syed Imran Hassan Abdi

SRE (Arbisoft)

Consulting SRE Engineer

@Jeremy Ristau

TNL (owner of Mongo usage in edx-platform)

TNL Lead

@Kyle McCormick (Do Not Use) (Deactivated)

TNL (owner of Mongo usage in edx-platform)

Review (edx-platform)

@Awais Jibran

Infinity squad (IDA owner of cs-comments-service)

Review (cs-comments-service)

@Nimisha Asthagiri (Deactivated)

Open edX

Stakeholder (Infrastructure)

@Ned Batchelder

Open edX

Stakeholder (Open edX named release)

Open edX Community

Open edX Community

Stakeholder (Consumers of Maple named release)

Communication Plan

Due to a small number of stakeholders we will mostly likely communicate ad hoc via Slack. Dedicated channel is #mongodb-4-4.

Roll out will need to be announced after the fact to all-engineering.