Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • All prod mongos are currently 3.0
  • Devstack is still running 2.6
    • Odd that there is still a 2.6 package but no 3.0
  • Kevin outlined an upgrade process as such:
    • Testing in devstack
    • Testing in Jenkins
    • Testing the following steps on a load test cluster during a test, then doing them on the prod cluster:
      • Replace the hidden secondary mongo 3.2
      • Replace each secondary mongo with 3.2
      • Fail over the primary to a secondary and replace with a 3.2 instance
      • Repeat with 3.4 if desired
  • Upgrade process is somewhat complicated by the size of the database (~1.3tb, ~650 gigs on disk)
  • May need to do a pymongo upgrade to support upgrading to 3.2 or 3.4
  • BMez tests on devstack showed no failures on a naive upgrade to 3.4 without a pymongo upgrade
    • Doesn't mean it wouldn't break in other, more complicated environments
  • Discussion of the potential value of trimming the data before moving
    • Deleting orphaned nodes
      • Fun problem, but complicated and potentially dangerous from a data loss and database performance perspective
      • Some work has been done on this by Ed?
      • Unknown priority
    • Pruning course history to only keep X versions
      • Some work may has been done?
      • Unknown priority
      • No SLA for number of versions to keep
      • No user-facing tools even exist to roll back versions (management command occasionally used)
      • Tools may get built soon, though, with new teams?
      • Potentially irritating to users, potential to cause prod database problems
    • Splitting forums to a separate cluster
      • Only about 10% of the database size
      • Expensive to run a 2nd cluster
      • Maybe worth it if the alternative is changes to forums to support 3.2 or 3.4?
  • Depending on the savings we can get in time of the upgrades it might be worth doing some of this work sooner
    • 8 different ~1TB upgrades is a loooooong maintenance window (should be a 0 downtime, but degraded performance window.  11 hours per sync right now)
      • Being in an asymmetric state during this window increases risk if something goes wrong over that time (week or more?)