Django 1.11 rollout/rollback plan
Rollout
This plan involves preemptively expanding the username column to varchar(150).  We would not be vulnerable to ungraceful registration failures since usernames larger than 30 characters (let alone 150) will not pass ORM validation anyway. Downtime taken is uncoupled from the 1.11 deployment. We have confirmed that Django 1.11 will not validate and fail if the column type is already changed to varchar(150) before migrating. Also, running the same exact ALTER TABLE query twice in a row does not cause mysql to lock/rebuild the table the second time.
Contacts
- Operator:Â Joe Mulloy (Deactivated)
- Backup:Â Fred Smith (Deactivated)
- Sribe:Â Troy Sankey
- Comms:Â Brian Beggs
- StatusPages.IO/Backup Comms:Â Cory Lee (Deactivated)
AMIs
Environment | Django | AMI ID |
---|---|---|
stage-edx-edxapp | 1.10 | ami-b978a0c4 |
prod-edx-edxapp | 1.10 | ami-d968b0a4 |
prod-edge-edxapp | 1.10 | ami-d1548cac |
stage-edx-edxapp | 1.11 | ami-1e5a8263 |
prod-edx-edxapp | 1.11 | ami-4c5f8731 |
prod-edge-edxapp | 1.11 | ami-d25981af |
Steps:
- Manually pause all of the following pipelines to prevent automatically running migrations under Django 1.11
- STAGE_edxapp_M-D
- PROD_edx_edxapp_M-D
- PROD_edge_edxapp_M-D
Branch off the Django 1.11 branch, and name it "pwnage101/django-1.10-from-rc". Add a new commit containing a django 1.10 version override:
git checkout pwnage101/bump-django-to-1.11 git checkout -b pwnage101/django-1.10-from-rc echo 'Django==1.10.8' >requirements/edx/django.txt git add requirements/edx/django.txt git commit -m 'change django to 1.10 to mitigate 1.11 migration bug' -m 'see https://code.djangoproject.com/ticket/29193' git push --set-upstream origin pwnage101/django-1.10-from-rc
- Use alton to construct 1.10 AMIs based on each new 1.11 AMI:
@alton cut ami for stage-edx-edxapp from stage-edx-edxapp with edx_platform_version=pwnage101/django-1.10-from-rc configuration=72b759b73a6e76b723f7dc403c1d0223bd91c1f0 configuration_secure=92a03ff886760c9f11ac3211397eee1463353eff configuration_internal_version=3b91306e02a45df855b2783313d7a21a971d3a49 using ami-43a15f3e
@alton cut ami for prod-edx-edxapp from prod-edx-edxapp with edx_platform_version=pwnage101/django-1.10-from-rc configuration=72b759b73a6e76b723f7dc403c1d0223bd91c1f0 configuration_secure=92a03ff886760c9f11ac3211397eee1463353eff configuration_internal_version=3b91306e02a45df855b2783313d7a21a971d3a49 using ami-43a15f3e
@alton cut ami for prod-edge-edxapp from prod-edge-edxapp with edx_platform_version=pwnage101/django-1.10-from-rc configuration=72b759b73a6e76b723f7dc403c1d0223bd91c1f0 configuration_secure=fad8d710d6b9bd5c3f8e1d99c59999110d665689 configuration_internal_version=e22d3c5d08a571030923f49cfd0531a67cfda92b using ami-43a15f3e
Merge Django 1.11 version bump and migrations to edx-platform release-candidate
- use existing branch pwnage101/bump-django-to-1.11
- Wait for the Django 1.11 AMIs to be built by the pipeline:
- For stage-edx-edxapp:
- From the EC2 console, launch a new instance "like" an existing Django 1.8 edxapp worker, but with a few tweaks:
- make sure to use the same instance type (c4.2xlarge)
- change the AMI to the AMI for stage-edx-edxapp Django 1.10. Refer to the table above
- delete the tag key "services" so that the worker app won't actually start
- append "-django-1.10" to the the instance identifier
- change the security groups to:
- stage-edx-WorkerServerSecurityGroup-93NZAY5JJ3DM
- launch the instance
- Login to the new instance lauched in the previous step
- create a tmux session
obtain the DB migrate username and password from https://github.com/edx-ops/edx-secure/blob/master/ansible/vars/db/stage-edx-edxapp.yml#L19-L21
confirm django_migrations state:
read DB_MIGRATION_PASS export DB_MIGRATION_PASS DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms --list auth
Confirm that only 2 migrations have not been run(0007 and 0008)
[X] 0001_initial [X] 0002_alter_permission_name_max_length [X] 0003_alter_user_email_max_length [X] 0004_alter_user_username_opts [X] 0005_alter_user_last_login_null [X] 0006_require_contenttypes_0002 [ ] 0007_alter_validators_add_error_messages [ ] 0008_alter_user_username_max_length
Activate this waffle switch (account creations and password changes will be disabled on edx.org) from https://courses.stage.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes
Migrate the auth app:
DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms auth
- Monitor New Relic for slow response times
When you notice slower response times stop supervisor on all stage-edx-edxapp machines.
ansible -i stage-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl stop edxapp:*"
- Wait for the migration to finish in the tmux session.
- Re-enable supervisor
ansible -i stage-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl start edxapp:*"
Deactivate this waffle switch (re-enabling account creations and password changes) from https://courses.stage.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes
On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:
$ /edx/bin/stage-edx-edxapp-mysql.sh mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username'; +--------------+ | COLUMN_TYPE | +--------------+ | varchar(150) | +--------------+
- Unpause the stage_edx_deploy_M-D pipeline. This will trigger the remaining Django 1.11 migrations and run the deployment on stage.
- From the EC2 console, launch a new instance "like" an existing Django 1.8 edxapp worker, but with a few tweaks:
- For prod-edx-edxapp:
- From the EC2 console, launch a new instance "like" an existing Django 1.8 edxapp worker, but with a few tweaks:
- make sure to use the same instance type (c4.2xlarge)
- change the AMI to the AMI for prod-edx-edxapp Django 1.10. Refer to the table above
- delete the tag key "services" so that the worker app won't actually start
- append "-django-1.10" to the the instance identifier
- change the security groups to:
- prod-edx-WorkerServerSecurityGroup-1BXCIEREYRREX
- launch the instance
- Login to the new instance lauched in the previous step
- create a tmux session
obtain the DB migrate username and password from https://github.com/edx-ops/edx-secure/blob/master/ansible/vars/db/prod-edx-edxapp.yml#L20-L22
confirm django_migrations state:
read DB_MIGRATION_PASS export DB_MIGRATION_PASS DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms --list auth
Confirm that only 2 migrations have not been run(0007 and 0008)
[X] 0001_initial [X] 0002_alter_permission_name_max_length [X] 0003_alter_user_email_max_length [X] 0004_alter_user_username_opts [X] 0005_alter_user_last_login_null [X] 0006_require_contenttypes_0002 [ ] 0007_alter_validators_add_error_messages [ ] 0008_alter_user_username_max_length
Activate this waffle switch (account creations and password changes will be disabled on edx.org) from https://courses.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes
Migrate the auth app:
DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms auth
- Monitor New Relic for slow response times
When you notice slower response times stop supervisor on all prod-edx-edxapp machines to bring up the maintenance page.
ansible -i prod-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl stop edxapp:*"
- Wait for the migration to finish in the tmux session.
Re-enable supervisor
ansible -i prod-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl start edxapp:*"
On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:
$ /edx/bin/prod-edx-edxapp-mysql.sh mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username'; +--------------+ | COLUMN_TYPE | +--------------+ | varchar(150) | +--------------+
Deactivate this waffle switch (re-enabling account creations and password changes) from https://courses.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes
On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:
$ /edx/bin/prod-edx-edxapp-mysql.sh mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username'; +--------------+ | COLUMN_TYPE | +--------------+ | varchar(150) | +--------------+
- Unpause the prod_edx_edxapp_deploy_M_D pipeline. This will trigger the remaining Django 1.11 migrations and run the deployment to prod.
- at 12:00 EDT take down the system maintenance banner using the Global Status Message configuration: http://courses.edx.org/admin
- From the EC2 console, launch a new instance "like" an existing Django 1.8 edxapp worker, but with a few tweaks:
- For prod-edge-edxapp:
- Switch to the edge AWS account
- From the EC2 console, launch a new instance "like" an existing Django 1.8 edxapp worker, but with a few tweaks:
- make sure to use the same instance type (c4.2xlarge)
- change the AMI to the AMI for prod-edx-edxapp Django 1.10. Refer to the table above
- delete the tag key "services" so that the worker app won't actually start
- append "-django-1.10" to the the instance identifier
- change the security groups to:
- prod-edge-WorkerServerSecurityGroup-CIXQOM99GRRI
- launch the instance
- Login to the new instance lauched in the previous step
- create a tmux session
obtain the DB migrate username and password from https://github.com/edx-ops/edge-secure/blob/master/ansible/vars/db/prod-edge-edxapp.yml#L19-L21
confirm django_migrations state:
read DB_MIGRATION_PASS export DB_MIGRATION_PASS DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms --list auth
Confirm that only 2 migrations have not been run(0007 and 0008)
[X] 0001_initial [X] 0002_alter_permission_name_max_length [X] 0003_alter_user_email_max_length [X] 0004_alter_user_username_opts [X] 0005_alter_user_last_login_null [X] 0006_require_contenttypes_0002 [ ] 0007_alter_validators_add_error_messages [ ] 0008_alter_user_username_max_length
Activate this waffle switch (account creations and password changes will be disabled on edx.org) from https://edge.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes
Migrate the auth app:
DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms auth
- Monitor New Relic for slow response times
When you notice slower response times stop supervisor on all prod-edge-edxapp machines to bring up the maintenance page.
ansible -i prod-edge-inventory.ini all -m shell -a "/edx/bin/supervisorctl stop edxapp:*"
- Wait for the migration to finish in the tmux session.
Re-enable supervisor
ansible -i prod-edge-inventory.ini all -m shell -a "/edx/bin/supervisorctl start edxapp:*"
Deactivate this waffle switch (re-enabling account creations and password changes) from https://edge.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes
On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:
$ /edx/bin/prod-edge-edxapp-mysql.sh mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username'; +--------------+ | COLUMN_TYPE | +--------------+ | varchar(150) | +--------------+
- Unpause the prod_edge_edxapp_deploy_M_D pipeline. This will trigger the remaining Django 1.11 migrations and run the deployment on edge.
Rollback
Migration considerations: DO NOT ROLLBACK THE MIGRATIONS If the migration auth/0008_alter_user_username_max_length was deployed (username column width changed to 150), do not reverse/rollback this even in the event of a code rollback because it requires downtime to change. The third_party_auth/0015_remove_icon_class_image_secondary_fields migration drops columns, but in a table which is currently empty. The remaining migrations are largely no-ops, so they can be left in the database as ghost migrations for a short period of time until a fix is developed.
Causes for concern/Possible rollback reasons:
- There was a non-backwards-compatible change to CSRF token generation introduced in Django >1.8 that we attempted to shim using a 3rd party library (https://github.com/edx/edx-platform/pull/16812).
- Load tests revealed an upward linear trend in memory consumption over the course of 10 hours, rather than a plateau:
- Additional test was able to push memory consumption up to 11.5GB without plateauing:
Steps:
Adapted from LMS/Studio Rollback#Code.
Rollback edxapp code from Django 1.11 to Django 1.8
Migrations overview
The following migrations would be created by upgrading from Django 1.8 to 1.11:
app | name | django version | edx-platform table? * | notes |
---|---|---|---|---|
admin | 0002_logentry_remove_auto_add | 1.9 | no | This is a DB no-op. |
auth | 0007_alter_validators_add_error_messages | 1.9 | no | Unclear to me if this is a no-op, but it certainly was fast in loadtest. |
auth | 0008_alter_user_username_max_length | 1.10 | no | Super time consuming in loadtest! |
sites | 0002_alter_domain_unique | 1.9 | no | No-op in prod DB because domains are already unique. |
certificates | 0014_change_eligible_certs_manager | <= 1.10 | yes | |
course_modes | 0011_change_regex_for_comma_separated_ints | <= 1.10 | yes | New CSV validation will pass, and this migration will not modify the DB. |
third_party_auth | 0015_remove_icon_class_image_secondary_fields | <= 1.10 | yes | Drops three columns from table "third_party_auth_ltiproviderconfig"; should be super quick since this table is empty in prod. |
* = i.e. is this migration committed to edx-platform codebase.
The auth/0008_alter_user_username_max_length migration took over 1 hour in loadtest before we terminated it. This is the SQL corresponding to the migration:
$ ./manage.py lms --settings=aws sqlmigrate auth 0008_alter_user_username_max_length BEGIN; -- -- Alter field username on user -- ALTER TABLE `auth_user` MODIFY `username` varchar(150) NOT NULL; COMMIT;
This migration causes a temp auth_user table to be rebuilt, all while locking the table. This operation took over an hour in loadtest, presumably because there are so many automatically generated test users. Running this query against an RDS instance restored from a recent prod snapshot took 22 minutes:
The third_party_auth/0015_remove_icon_class_image_secondary_fields migration has rollback implications because it drops columns from a table. According to Everything About Database Migrations, we normally avoid dropping columns, but this migration stems from a model derived from Django itself which we have little control over. Fortunately, this table is currently EMPTY in prod, so In the case of a code rollback back to Django 1.8 this table should be trivial to roll back as well.