Attendees
Kevin Falcone (Deactivated), Jeremy Bowman (Deactivated), Brian Mesick (Deactivated)
Purpose
Initially this meeting was to discuss the potential timeline and plan for upgrading MongoDB across the edX environments, with an eye (from the Platform side) toward reducing test times by using the in-memory storage engine introduced in Mongo 3.4. We discovered that the in-memory engine is only available in Enterprise, however, and so shifted the focus to a brain dump of current Mongo infrastructure and other waysÂ
Platform and DevOps might need to address the semi-urgent need for a Mongo upgrade.
Notes
Kevin drew the current state of the cluster and apps that use it as follows:
Availability Zone B | Availability Zone C | Availability Zone D | Availability Zone E |
---|---|---|---|
edxapp | edxapp | edxapp | edxapp |
edx celery worker | edx celery worker | ||
Primary Mongo (Forums use this one only) | Secondary Mongo | Secondary Mongo | |
Hidden Secondary Mongo | |||
Forums | Forums |
- All prod mongos are currently 3.0
- Devstack is still running 2.6
- Odd that there is still a 2.6 package but no 3.0
- Kevin outlined an upgrade process as such:
- Testing in devstack
- Testing in Jenkins
- Testing the following steps on a load test cluster during a test, then doing them on the prod cluster:
- Replace the hidden secondary mongo 3.2
- Replace each secondary mongo with 3.2
- Fail over the primary to a secondary and replace with a 3.2 instance
- Repeat with 3.4 if desired
- Upgrade process is somewhat complicated by the size of the database (~1tb, ~600 megs on disk)
- May need to do a pymongo upgrade to support upgrading to 3.2 or 3.4
- BMez tests on devstack showed no failures on a naive upgrade to 3.4 without a pymongo upgrade
- Doesn't mean it wouldn't break in other, more complicated environments
- Discussion of the potential value of trimming the data before moving
- Deleting orphaned nodes
- Fun problem, but complicated and potentially dangerous from a data loss and database performance perspective
- Some work has been done on this by Ed?
- Unknown priority
- Pruning course history to only keep X versions
- Some work may has been done?
- Unknown priority
- No SLA for number of versions to keep
- No user-facing tools even exist to roll back versions (management command occasionally used)
- Tools may get built soon, though, with new teams?
- Potentially irritating to users, potential to cause prod database problems
- Splitting forums to a separate cluster
- Only about 10% of the database size
- Expensive to run a 2nd cluster
- Maybe worth it if the alternative is changes to forums to support 3.2 or 3.4?
- Deleting orphaned nodes
- Depending on the savings we can get in time of the upgrades it might be worth doing some of this work sooner
- 8 different ~1TB upgrades is a loooooong maintenance window
- Being in an asymmetric state during this window increases risk if something goes wrong over that time (week or more?)
- 8 different ~1TB upgrades is a loooooong maintenance window