This plan involves preemptively expanding the username column to varchar(150). We would not be vulnerable to ungraceful registration failures since usernames larger than 30 characters (let alone 150) will not pass ORM validation anyway. Downtime taken is uncoupled from the 1.11 deployment. We have confirmed that Django 1.11 will not validate and fail if the column type is already changed to varchar(150) before migrating. Also, running the same exact ALTER TABLE query twice in a row does not cause mysql to lock/rebuild the table the second time.
Environment | Django | AMI ID |
---|---|---|
stage-edx-edxapp | 1.10 | ami-b978a0c4 |
prod-edx-edxapp | 1.10 | ami-d968b0a4 |
prod-edge-edxapp | 1.10 | ami-d1548cac |
stage-edx-edxapp | 1.11 | ami-1e5a8263 |
prod-edx-edxapp | 1.11 | ami-4c5f8731 |
prod-edge-edxapp | 1.11 | ami-d25981af |
Steps:
Branch off the Django 1.11 branch, and name it "pwnage101/django-1.10-from-rc". Add a new commit containing a django 1.10 version override:
git checkout pwnage101/bump-django-to-1.11 git checkout -b pwnage101/django-1.10-from-rc echo 'Django==1.10.8' >requirements/edx/django.txt git add requirements/edx/django.txt git commit -m 'change django to 1.10 to mitigate 1.11 migration bug' -m 'see https://code.djangoproject.com/ticket/29193' git push --set-upstream origin pwnage101/django-1.10-from-rc |
@alton cut ami for stage-edx-edxapp from stage-edx-edxapp with edx_platform_version=pwnage101/django-1.10-from-rc configuration=72b759b73a6e76b723f7dc403c1d0223bd91c1f0 configuration_secure=92a03ff886760c9f11ac3211397eee1463353eff configuration_internal_version=3b91306e02a45df855b2783313d7a21a971d3a49 using ami-43a15f3e
@alton cut ami for prod-edx-edxapp from prod-edx-edxapp with edx_platform_version=pwnage101/django-1.10-from-rc configuration=72b759b73a6e76b723f7dc403c1d0223bd91c1f0 configuration_secure=92a03ff886760c9f11ac3211397eee1463353eff configuration_internal_version=3b91306e02a45df855b2783313d7a21a971d3a49 using ami-43a15f3e
@alton cut ami for prod-edge-edxapp from prod-edge-edxapp with edx_platform_version=pwnage101/django-1.10-from-rc configuration=72b759b73a6e76b723f7dc403c1d0223bd91c1f0 configuration_secure=fad8d710d6b9bd5c3f8e1d99c59999110d665689 configuration_internal_version=e22d3c5d08a571030923f49cfd0531a67cfda92b using ami-43a15f3e
Merge Django 1.11 version bump and migrations to edx-platform release-candidate
obtain the DB migrate username and password from https://github.com/edx-ops/edx-secure/blob/master/ansible/vars/db/stage-edx-edxapp.yml#L19-L21
confirm django_migrations state:
read DB_MIGRATION_PASS export DB_MIGRATION_PASS DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms --list auth |
Confirm that only 2 migrations have not been run(0007 and 0008)
[X] 0001_initial [X] 0002_alter_permission_name_max_length [X] 0003_alter_user_email_max_length [X] 0004_alter_user_username_opts [X] 0005_alter_user_last_login_null [X] 0006_require_contenttypes_0002 [ ] 0007_alter_validators_add_error_messages [ ] 0008_alter_user_username_max_length |
Activate this waffle switch (account creations and password changes will be disabled on edx.org) from https://courses.stage.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes |
Migrate the auth app:
DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms auth |
When you notice slower response times stop supervisor on all stage-edx-edxapp machines.
ansible -i stage-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl stop edxapp:*" |
ansible -i stage-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl start edxapp:*" |
Deactivate this waffle switch (re-enabling account creations and password changes) from https://courses.stage.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes |
On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:
$ /edx/bin/stage-edx-edxapp-mysql.sh mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username'; +--------------+ | COLUMN_TYPE | +--------------+ | varchar(150) | +--------------+ |
obtain the DB migrate username and password from https://github.com/edx-ops/edx-secure/blob/master/ansible/vars/db/prod-edx-edxapp.yml#L20-L22
confirm django_migrations state:
read DB_MIGRATION_PASS export DB_MIGRATION_PASS DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms --list auth |
Confirm that only 2 migrations have not been run(0007 and 0008)
[X] 0001_initial [X] 0002_alter_permission_name_max_length [X] 0003_alter_user_email_max_length [X] 0004_alter_user_username_opts [X] 0005_alter_user_last_login_null [X] 0006_require_contenttypes_0002 [ ] 0007_alter_validators_add_error_messages [ ] 0008_alter_user_username_max_length |
Activate this waffle switch (account creations and password changes will be disabled on edx.org) from https://courses.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes |
Migrate the auth app:
DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms auth |
When you notice slower response times stop supervisor on all prod-edx-edxapp machines to bring up the maintenance page.
ansible -i prod-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl stop edxapp:*" |
Re-enable supervisor
ansible -i prod-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl start edxapp:*" |
On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:
$ /edx/bin/prod-edx-edxapp-mysql.sh mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username'; +--------------+ | COLUMN_TYPE | +--------------+ | varchar(150) | +--------------+ |
Deactivate this waffle switch (re-enabling account creations and password changes) from https://courses.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes |
On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:
$ /edx/bin/prod-edx-edxapp-mysql.sh mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username'; +--------------+ | COLUMN_TYPE | +--------------+ | varchar(150) | +--------------+ |
obtain the DB migrate username and password from https://github.com/edx-ops/edge-secure/blob/master/ansible/vars/db/prod-edge-edxapp.yml#L19-L21
confirm django_migrations state:
read DB_MIGRATION_PASS export DB_MIGRATION_PASS DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms --list auth |
Confirm that only 2 migrations have not been run(0007 and 0008)
[X] 0001_initial [X] 0002_alter_permission_name_max_length [X] 0003_alter_user_email_max_length [X] 0004_alter_user_username_opts [X] 0005_alter_user_last_login_null [X] 0006_require_contenttypes_0002 [ ] 0007_alter_validators_add_error_messages [ ] 0008_alter_user_username_max_length |
Activate this waffle switch (account creations and password changes will be disabled on edx.org) from https://edge.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes |
Migrate the auth app:
DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms auth |
When you notice slower response times stop supervisor on all prod-edge-edxapp machines to bring up the maintenance page.
ansible -i prod-edge-inventory.ini all -m shell -a "/edx/bin/supervisorctl stop edxapp:*" |
Re-enable supervisor
ansible -i prod-edge-inventory.ini all -m shell -a "/edx/bin/supervisorctl start edxapp:*" |
Deactivate this waffle switch (re-enabling account creations and password changes) from https://edge.edx.org/admin/waffle/switch/
user_api.prevent_auth_user_writes |
On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:
$ /edx/bin/prod-edge-edxapp-mysql.sh mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username'; +--------------+ | COLUMN_TYPE | +--------------+ | varchar(150) | +--------------+ |
Migration considerations: DO NOT ROLLBACK THE MIGRATIONS If the migration auth/0008_alter_user_username_max_length was deployed (username column width changed to 150), do not reverse/rollback this even in the event of a code rollback because it requires downtime to change. The third_party_auth/0015_remove_icon_class_image_secondary_fields migration drops columns, but in a table which is currently empty. The remaining migrations are largely no-ops, so they can be left in the database as ghost migrations for a short period of time until a fix is developed.
Causes for concern/Possible rollback reasons:
Steps:
Adapted from LMS/Studio Rollback#Code.
Rollback edxapp code from Django 1.11 to Django 1.8
The following migrations would be created by upgrading from Django 1.8 to 1.11:
app | name | django version | edx-platform table? * | notes |
---|---|---|---|---|
admin | 0002_logentry_remove_auto_add | 1.9 | no | This is a DB no-op. |
auth | 0007_alter_validators_add_error_messages | 1.9 | no | Unclear to me if this is a no-op, but it certainly was fast in loadtest. |
auth | 0008_alter_user_username_max_length | 1.10 | no | Super time consuming in loadtest! |
sites | 0002_alter_domain_unique | 1.9 | no | No-op in prod DB because domains are already unique. |
certificates | 0014_change_eligible_certs_manager | <= 1.10 | yes | |
course_modes | 0011_change_regex_for_comma_separated_ints | <= 1.10 | yes | New CSV validation will pass, and this migration will not modify the DB. |
third_party_auth | 0015_remove_icon_class_image_secondary_fields | <= 1.10 | yes | Drops three columns from table "third_party_auth_ltiproviderconfig"; should be super quick since this table is empty in prod. |
* = i.e. is this migration committed to edx-platform codebase.
The auth/0008_alter_user_username_max_length migration took over 1 hour in loadtest before we terminated it. This is the SQL corresponding to the migration:
$ ./manage.py lms --settings=aws sqlmigrate auth 0008_alter_user_username_max_length BEGIN; -- -- Alter field username on user -- ALTER TABLE `auth_user` MODIFY `username` varchar(150) NOT NULL; COMMIT; |
This migration causes a temp auth_user table to be rebuilt, all while locking the table. This operation took over an hour in loadtest, presumably because there are so many automatically generated test users. Running this query against an RDS instance restored from a recent prod snapshot took 22 minutes:
The third_party_auth/0015_remove_icon_class_image_secondary_fields migration has rollback implications because it drops columns from a table. According to Everything About Database Migrations, we normally avoid dropping columns, but this migration stems from a model derived from Django itself which we have little control over. Fortunately, this table is currently EMPTY in prod, so In the case of a code rollback back to Django 1.8 this table should be trivial to roll back as well.