Mongo 3.4 Discussion & Meeting Notes

Attendees

Kevin Falcone (Deactivated), Jeremy Bowman (Deactivated), Brian Mesick (Deactivated)

Purpose

Initially this meeting was to discuss the potential timeline and plan for upgrading MongoDB across the edX environments, with an eye (from the Platform side) toward reducing test times by using the in-memory storage engine introduced in Mongo 3.4. We discovered that the in-memory engine is only available in Enterprise, however, and so shifted the focus to a brain dump of current Mongo infrastructure and other ways 

Platform and DevOps might need to address the semi-urgent need for a Mongo upgrade.

Notes

Kevin drew the current state of the cluster and apps that use it as follows:

Availability Zone BAvailability Zone CAvailability Zone DAvailability Zone E
edxappedxappedxappedxapp
edx celery workeredx celery worker

Primary Mongo

(Forums use this one only)

Secondary MongoSecondary Mongo

Hidden Secondary Mongo

ForumsForums


  • All prod mongos are currently 3.0
  • Devstack is still running 2.6
    • Odd that there is still a 2.6 package but no 3.0
  • Kevin outlined an upgrade process as such:
    • Testing in devstack
    • Testing in Jenkins
    • Testing the following steps on a load test cluster during a test, then doing them on the prod cluster:
      • Replace the hidden secondary mongo 3.2
      • Replace each secondary mongo with 3.2
      • Fail over the primary to a secondary and replace with a 3.2 instance
      • Repeat with 3.4 if desired
  • Upgrade process is somewhat complicated by the size of the database (~1.3tb, ~650 gigs on disk)
  • May need to do a pymongo upgrade to support upgrading to 3.2 or 3.4
  • BMez tests on devstack showed no failures on a naive upgrade to 3.4 without a pymongo upgrade
    • Doesn't mean it wouldn't break in other, more complicated environments
  • Discussion of the potential value of trimming the data before moving
    • Deleting orphaned nodes
      • Fun problem, but complicated and potentially dangerous from a data loss and database performance perspective
      • Some work has been done on this by Ed?
      • Unknown priority
    • Pruning course history to only keep X versions
      • Some work may has been done?
      • Unknown priority
      • No SLA for number of versions to keep
      • No user-facing tools even exist to roll back versions (management command occasionally used)
      • Tools may get built soon, though, with new teams?
      • Potentially irritating to users, potential to cause prod database problems
    • Splitting forums to a separate cluster
      • Only about 10% of the database size
      • Expensive to run a 2nd cluster
      • Maybe worth it if the alternative is changes to forums to support 3.2 or 3.4?
  • Depending on the savings we can get in time of the upgrades it might be worth doing some of this work sooner
    • 8 different ~1TB upgrades is a loooooong maintenance window (should be a 0 downtime, but degraded performance window.  11 hours per sync right now)
      • Being in an asymmetric state during this window increases risk if something goes wrong over that time (week or more?)