Django 1.11 rollout/rollback plan

Rollout

This plan involves preemptively expanding the username column to varchar(150).   We would not be vulnerable to ungraceful registration failures since usernames larger than 30 characters (let alone 150) will not pass ORM validation anyway.  Downtime taken is uncoupled from the 1.11 deployment.  We have confirmed that Django 1.11 will not validate and fail if the column type is already changed to varchar(150) before migrating.  Also, running the same exact ALTER TABLE query twice in a row does not cause mysql to lock/rebuild the table the second time.

Contacts

AMIs

EnvironmentDjangoAMI ID
stage-edx-edxapp1.10ami-b978a0c4
prod-edx-edxapp1.10ami-d968b0a4
prod-edge-edxapp1.10ami-d1548cac
stage-edx-edxapp1.11ami-1e5a8263
prod-edx-edxapp1.11ami-4c5f8731
prod-edge-edxapp1.11ami-d25981af

Steps:

  • Manually pause all of the following pipelines to prevent automatically running migrations under Django 1.11
    • STAGE_edxapp_M-D
    • PROD_edx_edxapp_M-D
    • PROD_edge_edxapp_M-D
  • Branch off the Django 1.11 branch, and name it "pwnage101/django-1.10-from-rc".  Add a new commit containing a django 1.10 version override:

    git checkout pwnage101/bump-django-to-1.11
    git checkout -b pwnage101/django-1.10-from-rc
    echo 'Django==1.10.8' >requirements/edx/django.txt
    git add requirements/edx/django.txt
    git commit -m 'change django to 1.10 to mitigate 1.11 migration bug' -m 'see https://code.djangoproject.com/ticket/29193'
    git push --set-upstream origin pwnage101/django-1.10-from-rc 
  • Use alton to construct 1.10 AMIs based on each new 1.11 AMI:
    • @alton cut ami for stage-edx-edxapp from stage-edx-edxapp with edx_platform_version=pwnage101/django-1.10-from-rc configuration=72b759b73a6e76b723f7dc403c1d0223bd91c1f0 configuration_secure=92a03ff886760c9f11ac3211397eee1463353eff configuration_internal_version=3b91306e02a45df855b2783313d7a21a971d3a49 using ami-43a15f3e
    • @alton cut ami for prod-edx-edxapp from prod-edx-edxapp with edx_platform_version=pwnage101/django-1.10-from-rc configuration=72b759b73a6e76b723f7dc403c1d0223bd91c1f0 configuration_secure=92a03ff886760c9f11ac3211397eee1463353eff configuration_internal_version=3b91306e02a45df855b2783313d7a21a971d3a49  using ami-43a15f3e
    • @alton cut ami for prod-edge-edxapp from prod-edge-edxapp with edx_platform_version=pwnage101/django-1.10-from-rc configuration=72b759b73a6e76b723f7dc403c1d0223bd91c1f0 configuration_secure=fad8d710d6b9bd5c3f8e1d99c59999110d665689 configuration_internal_version=e22d3c5d08a571030923f49cfd0531a67cfda92b using ami-43a15f3e
  • Merge Django 1.11 version bump and migrations to edx-platform release-candidate
    • use existing branch pwnage101/bump-django-to-1.11
  • Wait for the Django 1.11 AMIs to be built by the pipeline:
  • For stage-edx-edxapp:
    • From the EC2 console, launch a new instance "like" an existing Django 1.8 edxapp worker, but with a few tweaks:
      • make sure to use the same instance type (c4.2xlarge)
      • change the AMI to the AMI for stage-edx-edxapp Django 1.10.  Refer to the table above
      • delete the tag key "services" so that the worker app won't actually start
      • append "-django-1.10" to the the instance identifier
      • change the security groups to:
        • stage-edx-WorkerServerSecurityGroup-93NZAY5JJ3DM
      • launch the instance
    • Login to the new instance lauched in the previous step
      • create a tmux session
      • obtain the DB migrate username and password from https://github.com/edx-ops/edx-secure/blob/master/ansible/vars/db/stage-edx-edxapp.yml#L19-L21

      • confirm django_migrations state:

        read DB_MIGRATION_PASS
        export DB_MIGRATION_PASS
        DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms --list auth
        
        
        
      • Confirm that only 2 migrations have not been run(0007 and 0008)

         [X] 0001_initial
         [X] 0002_alter_permission_name_max_length
         [X] 0003_alter_user_email_max_length
         [X] 0004_alter_user_username_opts
         [X] 0005_alter_user_last_login_null
         [X] 0006_require_contenttypes_0002
         [ ] 0007_alter_validators_add_error_messages
         [ ] 0008_alter_user_username_max_length
    • Activate this waffle switch (account creations and password changes will be disabled on edx.org) from https://courses.stage.edx.org/admin/waffle/switch/

      user_api.prevent_auth_user_writes


    • Migrate the auth app:

      DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms auth
    • Monitor New Relic for slow response times
      • When you notice slower response times stop supervisor on all stage-edx-edxapp machines.

        ansible -i stage-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl stop edxapp:*"
      • Wait for the migration to finish in the tmux session.
      • Re-enable supervisor
        ansible -i stage-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl start edxapp:*"
    • Deactivate this waffle switch (re-enabling account creations and password changes) from https://courses.stage.edx.org/admin/waffle/switch/

      user_api.prevent_auth_user_writes
    • On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:

      $ /edx/bin/stage-edx-edxapp-mysql.sh
      mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS
                 WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username';
      +--------------+
      | COLUMN_TYPE  |
      +--------------+
      | varchar(150) |
      +--------------+
    • Unpause the stage_edx_deploy_M-D pipeline.  This will trigger the remaining Django 1.11 migrations and run the deployment on stage.
  • For prod-edx-edxapp:
    • From the EC2 console, launch a new instance "like" an existing Django 1.8 edxapp worker, but with a few tweaks:
      • make sure to use the same instance type (c4.2xlarge)
      • change the AMI to the AMI for prod-edx-edxapp Django 1.10.  Refer to the table above
      • delete the tag key "services" so that the worker app won't actually start
      • append "-django-1.10" to the the instance identifier
      • change the security groups to:
        • prod-edx-WorkerServerSecurityGroup-1BXCIEREYRREX
      • launch the instance
    • Login to the new instance lauched in the previous step
      • create a tmux session
      • obtain the DB migrate username and password from https://github.com/edx-ops/edx-secure/blob/master/ansible/vars/db/prod-edx-edxapp.yml#L20-L22

      • confirm django_migrations state:

        read DB_MIGRATION_PASS
        export DB_MIGRATION_PASS
        DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms --list auth
      • Confirm that only 2 migrations have not been run(0007 and 0008)

         [X] 0001_initial
         [X] 0002_alter_permission_name_max_length
         [X] 0003_alter_user_email_max_length
         [X] 0004_alter_user_username_opts
         [X] 0005_alter_user_last_login_null
         [X] 0006_require_contenttypes_0002
         [ ] 0007_alter_validators_add_error_messages
         [ ] 0008_alter_user_username_max_length
    • Activate this waffle switch (account creations and password changes will be disabled on edx.org) from https://courses.edx.org/admin/waffle/switch/

      user_api.prevent_auth_user_writes


    • Migrate the auth app:

      DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms auth
    • Monitor New Relic for slow response times
      • When you notice slower response times stop supervisor on all prod-edx-edxapp machines to bring up the maintenance page.

        ansible -i prod-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl stop edxapp:*"
      • Wait for the migration to finish in the tmux session.
      • Re-enable supervisor

        ansible -i prod-edx-inventory.ini all -m shell -a "/edx/bin/supervisorctl start edxapp:*"
    • On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:

      $ /edx/bin/prod-edx-edxapp-mysql.sh
      mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS
                 WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username';
      +--------------+
      | COLUMN_TYPE  |
      +--------------+
      | varchar(150) |
      +--------------+
    • Deactivate this waffle switch (re-enabling account creations and password changes) from https://courses.edx.org/admin/waffle/switch/

      user_api.prevent_auth_user_writes
    • On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:

      $ /edx/bin/prod-edx-edxapp-mysql.sh
      mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS
                 WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username';
      +--------------+
      | COLUMN_TYPE  |
      +--------------+
      | varchar(150) |
      +--------------+
    • Unpause the prod_edx_edxapp_deploy_M_D pipeline.  This will trigger the remaining Django 1.11 migrations and run the deployment to prod.
    • at 12:00 EDT take down the system maintenance banner using the Global Status Message configuration: http://courses.edx.org/admin
  • For prod-edge-edxapp:
    • Switch to the edge AWS account
    • From the EC2 console, launch a new instance "like" an existing Django 1.8 edxapp worker, but with a few tweaks:
      • make sure to use the same instance type (c4.2xlarge)
      • change the AMI to the AMI for prod-edx-edxapp Django 1.10.  Refer to the table above
      • delete the tag key "services" so that the worker app won't actually start
      • append "-django-1.10" to the the instance identifier
      • change the security groups to:
        • prod-edge-WorkerServerSecurityGroup-CIXQOM99GRRI
      • launch the instance
    • Login to the new instance lauched in the previous step
      • create a tmux session
      • obtain the DB migrate username and password from https://github.com/edx-ops/edge-secure/blob/master/ansible/vars/db/prod-edge-edxapp.yml#L19-L21

      • confirm django_migrations state:

        read DB_MIGRATION_PASS
        export DB_MIGRATION_PASS
        DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms --list auth
      • Confirm that only 2 migrations have not been run(0007 and 0008)

         [X] 0001_initial
         [X] 0002_alter_permission_name_max_length
         [X] 0003_alter_user_email_max_length
         [X] 0004_alter_user_username_opts
         [X] 0005_alter_user_last_login_null
         [X] 0006_require_contenttypes_0002
         [ ] 0007_alter_validators_add_error_messages
         [ ] 0008_alter_user_username_max_length
    • Activate this waffle switch (account creations and password changes will be disabled on edx.org) from https://edge.edx.org/admin/waffle/switch/

      user_api.prevent_auth_user_writes


    • Migrate the auth app:

      DB_MIGRATION_USER=migrate001 time /edx/bin/edxapp-migrate-cms auth
    • Monitor New Relic for slow response times
      • When you notice slower response times stop supervisor on all prod-edge-edxapp machines to bring up the maintenance page.

        ansible -i prod-edge-inventory.ini all -m shell -a "/edx/bin/supervisorctl stop edxapp:*"
      • Wait for the migration to finish in the tmux session.
      • Re-enable supervisor

        ansible -i prod-edge-inventory.ini all -m shell -a "/edx/bin/supervisorctl start edxapp:*"
    • Deactivate this waffle switch (re-enabling account creations and password changes) from https://edge.edx.org/admin/waffle/switch/

      user_api.prevent_auth_user_writes
    • On the read replica (tools-gp.edx.org), confirm that the auth_user table was successfully altered:

      $ /edx/bin/prod-edge-edxapp-mysql.sh
      mysql> SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS
                 WHERE TABLE_SCHEMA = 'wwc' AND TABLE_NAME = 'auth_user' AND COLUMN_NAME = 'username';
      +--------------+
      | COLUMN_TYPE  |
      +--------------+
      | varchar(150) |
      +--------------+
    • Unpause the prod_edge_edxapp_deploy_M_D pipeline.  This will trigger the remaining Django 1.11 migrations and run the deployment on edge.

Rollback

Migration considerations: DO NOT ROLLBACK THE MIGRATIONS If the migration auth/0008_alter_user_username_max_length was deployed (username column width changed to 150), do not reverse/rollback this even in the event of a code rollback because it requires downtime to change. The third_party_auth/0015_remove_icon_class_image_secondary_fields migration drops columns, but in a table which is currently empty.  The remaining migrations are largely no-ops, so they can be left in the database as ghost migrations for a short period of time until a fix is developed.

Causes for concern/Possible rollback reasons:

  • There was a non-backwards-compatible change to CSRF token generation introduced in Django >1.8 that we attempted to shim using a 3rd party library (https://github.com/edx/edx-platform/pull/16812).
  • Load tests revealed an upward linear trend in memory consumption over the course of 10 hours, rather than a plateau:
    • Additional test was able to push memory consumption up to 11.5GB without plateauing:

Steps:

Adapted from LMS/Studio Rollback#Code.

Rollback edxapp code from Django 1.11 to Django 1.8

Migrations overview

The following migrations would be created by upgrading from Django 1.8 to 1.11:

appnamedjango versionedx-platform table? *notes

admin

0002_logentry_remove_auto_add1.9noThis is a DB no-op.
auth0007_alter_validators_add_error_messages1.9noUnclear to me if this is a no-op, but it certainly was fast in loadtest.
auth0008_alter_user_username_max_length1.10noSuper time consuming in loadtest!
sites0002_alter_domain_unique1.9noNo-op in prod DB because domains are already unique.
certificates0014_change_eligible_certs_manager<= 1.10yes
course_modes0011_change_regex_for_comma_separated_ints<= 1.10yesNew CSV validation will pass, and this migration will not modify the DB.
third_party_auth0015_remove_icon_class_image_secondary_fields<= 1.10yesDrops three columns from table "third_party_auth_ltiproviderconfig"; should be super quick since this table is empty in prod.

* = i.e. is this migration committed to edx-platform codebase.

The auth/0008_alter_user_username_max_length migration took over 1 hour in loadtest before we terminated it.  This is the SQL corresponding to the migration:

$ ./manage.py lms --settings=aws sqlmigrate auth 0008_alter_user_username_max_length
BEGIN;
--
-- Alter field username on user
--
ALTER TABLE `auth_user` MODIFY `username` varchar(150) NOT NULL;
COMMIT;

This migration causes a temp auth_user table to be rebuilt, all while locking the table.  This operation took over an hour in loadtest, presumably because there are so many automatically generated test users.  Running this query against an RDS instance restored from a recent prod snapshot took 22 minutes:

The third_party_auth/0015_remove_icon_class_image_secondary_fields migration has rollback implications because it drops columns from a table.  According to Everything About Database Migrations, we normally avoid dropping columns, but this migration stems from a model derived from Django itself which we have little control over.  Fortunately, this table is currently EMPTY in prod, so In the case of a code rollback back to Django 1.8 this table should be trivial to roll back as well.