RCA: Enterprise partially-applied licensed course enrollment migration

Date

Feb 16, 2023

Team(s)

Enterprise, SRE

Summary

Enterprise migration that included adding a unique field was partially applied, failed, and left in partially-applied state on prod.

Impact

edxapp production deployment pipeline blocked; one enterprise learner temporary blocked from completing enrollment

Incident Duration

2 hours

Publicly-shareable lesson learned

Technical Cause and resolution

  • Deployment of an enterprise migration to prod failed. The migration, which added new tables and columns, was left in a partially applied state. Notably, we added a new unique UUID field that ended up with non-unique values before adding the unique constraint - thus the error. This was in spite of following the steps in https://docs.djangoproject.com/en/3.2/howto/writing-migrations/#migrations-that-add-unique-fields

    • We ran into a blue/green deploy, or “race condition” issue, because the manual addition a UUID had to be applied to ~400k rows, which gave ample time for user activity to cause new LicensedEnterpriseCourseEnrollment records to be written.

  • Splitting the migration up unto more atomic components and discrete deployments would have (and ultimately, did) fix the issue.

  • Later, we tried to deploy a new version of the migrations, which failed on stage/edge environments, where we forgot to rollback the original, fully-applied migrations.

Original, faulty migration attempt:

Revert of the above:

Final, good attempt:

What factors contributed to this incident?

  • Adding a unique UUID field to MySQL 5.7 via Django is complex; the default value of the column is actually a single, non-unique string in the format of a uuid.

  • Django migrations on MySQL 5.7 are non-atomic; if your migration operations fail partway through, the migration is left in a partially-applied state. Django considers the prior migration to be what’s “current” after the partial application.

    • “Django is pretty difficult to use correctly.” - Anon.

      • Having good recipes and decision trees could help.

      • Making smaller, atomic migration files would help.

      • Having well-broadcast knowledge of “the hard parts” might help.

  • We couldn’t successfully rollback stage migrations from GoCD.

    • We’re not sure if any edx-enterprise migration can be successfully rolled-back on stage from GoCD.

  • We have a good resource about “how to do migrations right”, it does not yet cover the use case of adding a unique UUID field to an existing table.

    • Is there a series of questions that authors of migrations can ask that would act as decision tree for “how many/what phases do I need for this migration to work?”

  • Nothing broke on local or stage.

    • Why did it not break on stage?

      • No traffic, and no load-testing occurred against it (load testing would have had to run while the uuid-generation part of the migration was running).

    • We often don’t understand that scale and usage profile of production vs. local/stage.

      • We’ve had similar incidents in the past (e.g. locks of tables) around scale and data contention.

  • Why can we not take downtime while migrations like this rollout?

    • Do we have a good definition of “downtime”?

    • Could we have made the downtime enterprise-specific?

      • Could we feature-flag edx-enterprise to make this be?

      • Could we design our systems to be more eventually-consistent?

    • We’re very sensitive to downtime in the service this migration was applied to (it’s the LMS/edxapp).

      • If we could take down only enterprise, it’d be more flexible, smaller blast radius to unplanned outages, etc.

  • Can we turn enterprise off on edge?

What did we learn?

  • Non-atomic Django-MySQL operations should be isolated in small migration files.

    • e.g. adding a column should be one (or maybe two, because of django-simple-history) migration file.

  • There are consequences to leaving our edge deployments in a bad state. Edge is a real system and should be treated as such

    • Not all product-delivery members know what the edge environment is for, and it’s not always top-of-mind when considering consequences of our actions and decisions.

  • We have enterprise admin and learner portal “status banners” that are useful to communicate planned or unplanned disfunctionality to our users.

  • Adding our technical learnings to the migrations reference doc doc probably can’t hurt.

  • Upgrading to MySQL 5.8 can’t hurt.