/
RCA: Enterprise partially-applied licensed course enrollment migration

RCA: Enterprise partially-applied licensed course enrollment migration

Date

Feb 16, 2023

Team(s)

Enterprise, SRE

Summary

Enterprise migration that included adding a unique field was partially applied, failed, and left in partially-applied state on prod.

Impact

edxapp production deployment pipeline blocked; one enterprise learner temporary blocked from completing enrollment

Incident Duration

2 hours

Publicly-shareable lesson learned

  • Non-atomic Django-MySQL operations should be isolated in small migration files.

  • There are consequences to leaving our edge deployments in a bad state. Edge is a real system and should be treated as such. Edge is not always top-of-mind when considering consequences of our actions and decisions.

  • We have enterprise admin and learner portal “status banners” that are useful to communicate planned or unplanned disfunctionality to our users.

  • Adding our technical learnings to the migrations reference doc doc probably can’t hurt. Everything About Database Migrations

  • Upgrading to MySQL 5.8 can’t hurt.

Technical Cause and resolution

  • Deployment of an enterprise migration to prod failed. The migration, which added new tables and columns, was left in a partially applied state. Notably, we added a new unique UUID field that ended up with non-unique values before adding the unique constraint - thus the error. This was in spite of following the steps in Writing database migrations | Django documentation

    • We ran into a blue/green deploy, or “race condition” issue, because the manual addition a UUID had to be applied to ~400k rows, which gave ample time for user activity to cause new LicensedEnterpriseCourseEnrollment records to be written.

  • Splitting the migration up unto more atomic components and discrete deployments would have (and ultimately, did) fix the issue.

  • Later, we tried to deploy a new version of the migrations, which failed on stage/edge environments, where we forgot to rollback the original, fully-applied migrations.

Original, faulty migration attempt:

Revert of the above:

Final, good attempt:

What factors contributed to this incident?

  • Adding a unique UUID field to MySQL 5.7 via Django is complex; the default value of the column is actually a single, non-unique string in the format of a uuid.

  • Django migrations on MySQL 5.7 are non-atomic; if your migration operations fail partway through, the migration is left in a partially-applied state. Django considers the prior migration to be what’s “current” after the partial application.

    • “Django is pretty difficult to use correctly.” - Anon.

      • Having good recipes and decision trees could help.

      • Making smaller, atomic migration files would help.

      • Having well-broadcast knowledge of “the hard parts” might help.

  • We couldn’t successfully rollback stage migrations from GoCD.

    • We’re not sure if any edx-enterprise migration can be successfully rolled-back on stage from GoCD.

  • We have a good resource about “how to do migrations right”, it does not yet cover the use case of adding a unique UUID field to an existing table.

    • Is there a series of questions that authors of migrations can ask that would act as decision tree for “how many/what phases do I need for this migration to work?”

  • Nothing broke on local or stage.

    • Why did it not break on stage?

      • No traffic, and no load-testing occurred against it (load testing would have had to run while the uuid-generation part of the migration was running).

    • We often don’t understand that scale and usage profile of production vs. local/stage.

      • We’ve had similar incidents in the past (e.g. locks of tables) around scale and data contention.

  • Why can we not take downtime while migrations like this rollout?

    • Do we have a good definition of “downtime”?

    • Could we have made the downtime enterprise-specific?

      • Could we feature-flag edx-enterprise to make this be?

      • Could we design our systems to be more eventually-consistent?

    • We’re very sensitive to downtime in the service this migration was applied to (it’s the LMS/edxapp).

      • If we could take down only enterprise, it’d be more flexible, smaller blast radius to unplanned outages, etc.

  • Can we turn enterprise off on edge?

What did we learn?

  • Non-atomic Django-MySQL operations should be isolated in small migration files.

    • e.g. adding a column should be one (or maybe two, because of django-simple-history) migration file.

  • There are consequences to leaving our edge deployments in a bad state. Edge is a real system and should be treated as such

    • Not all product-delivery members know what the edge environment is for, and it’s not always top-of-mind when considering consequences of our actions and decisions.

  • We have enterprise admin and learner portal “status banners” that are useful to communicate planned or unplanned disfunctionality to our users.

  • Adding our technical learnings to the migrations reference doc doc probably can’t hurt.

  • Upgrading to MySQL 5.8 can’t hurt.

Related content

CSM Primary Key Problems and Solutions
CSM Primary Key Problems and Solutions
More like this
Data Loss and Recovery from Migrations and Rollback
Data Loss and Recovery from Migrations and Rollback
More like this
CSM - gh-ost Runbook
CSM - gh-ost Runbook
More like this
Some Things About Manually Rolling Back Migrations
Some Things About Manually Rolling Back Migrations
More like this
VerifiedTrackCohortRemoval
VerifiedTrackCohortRemoval
More like this
Django 1.11 troubleshooting
Django 1.11 troubleshooting
More like this