RCA: Enterprise partially-applied licensed course enrollment migration
Date | Feb 16, 2023 |
Team(s) | Enterprise, SRE |
Summary | Enterprise migration that included adding a unique field was partially applied, failed, and left in partially-applied state on prod. |
Impact | edxapp production deployment pipeline blocked; one enterprise learner temporary blocked from completing enrollment |
Incident Duration | 2 hours |
Publicly-shareable lesson learned |
|
Technical Cause and resolution
Deployment of an enterprise migration to prod failed. The migration, which added new tables and columns, was left in a partially applied state. Notably, we added a new unique UUID field that ended up with non-unique values before adding the
unique
constraint - thus the error. This was in spite of following the steps in Writing database migrations | Django documentationWe ran into a blue/green deploy, or “race condition” issue, because the manual addition a UUID had to be applied to ~400k rows, which gave ample time for user activity to cause new
LicensedEnterpriseCourseEnrollment
records to be written.
Splitting the migration up unto more atomic components and discrete deployments would have (and ultimately, did) fix the issue.
Later, we tried to deploy a new version of the migrations, which failed on stage/edge environments, where we forgot to rollback the original, fully-applied migrations.
Original, faulty migration attempt:
Revert of the above:
Final, good attempt:
What factors contributed to this incident?
Adding a unique UUID field to MySQL 5.7 via Django is complex; the
default
value of the column is actually a single, non-unique string in the format of a uuid.Django migrations on MySQL 5.7 are non-atomic; if your migration operations fail partway through, the migration is left in a partially-applied state. Django considers the prior migration to be what’s “current” after the partial application.
“Django is pretty difficult to use correctly.” - Anon.
Having good recipes and decision trees could help.
Making smaller, atomic migration files would help.
Having well-broadcast knowledge of “the hard parts” might help.
We couldn’t successfully rollback stage migrations from GoCD.
We’re not sure if any edx-enterprise migration can be successfully rolled-back on stage from GoCD.
We have a good resource about “how to do migrations right”, it does not yet cover the use case of adding a unique UUID field to an existing table.
Is there a series of questions that authors of migrations can ask that would act as decision tree for “how many/what phases do I need for this migration to work?”
Nothing broke on local or stage.
Why did it not break on stage?
No traffic, and no load-testing occurred against it (load testing would have had to run while the uuid-generation part of the migration was running).
We often don’t understand that scale and usage profile of production vs. local/stage.
We’ve had similar incidents in the past (e.g. locks of tables) around scale and data contention.
Why can we not take downtime while migrations like this rollout?
Do we have a good definition of “downtime”?
Could we have made the downtime enterprise-specific?
Could we feature-flag edx-enterprise to make this be?
Could we design our systems to be more eventually-consistent?
We’re very sensitive to downtime in the service this migration was applied to (it’s the LMS/edxapp).
If we could take down only enterprise, it’d be more flexible, smaller blast radius to unplanned outages, etc.
Can we turn enterprise off on edge?
What did we learn?
Non-atomic Django-MySQL operations should be isolated in small migration files.
e.g. adding a column should be one (or maybe two, because of django-simple-history) migration file.
There are consequences to leaving our edge deployments in a bad state. Edge is a real system and should be treated as such
Not all product-delivery members know what the edge environment is for, and it’s not always top-of-mind when considering consequences of our actions and decisions.
We have enterprise admin and learner portal “status banners” that are useful to communicate planned or unplanned disfunctionality to our users.
Adding our technical learnings to the migrations reference doc doc probably can’t hurt.
Upgrading to MySQL 5.8 can’t hurt.