CSM - gh-ost Runbook

See PLAT-2437. This is a runbook for running gh-ost to alter the id field of courseware_studenthistory in edx-platform as safely and with as little downtime as possible.


Questions we need to answer before running this:

  1. What changes do we need to make to support both versions of CSMH? None that I can tell.
    1. Removing the FKs will likely mean that we need to use the integer IDs directly instead of using StudentModule objects, for example. We don't need to actually remove the FKs on the Django models, we can just set them to db_constraint=False, db_index=True and remove the actual FKs in the database. Django seems to handle to this just fine.
    2. Also manually handling deletes if they're cascading right now. It seems like Django handles the cascading deletes so nothing needs to be done here.
  2. Before or after we upgrade CSM we will also need to migrate both CSMH's to have an appropriately sized column for CSM fake foreign keys. Does needing to perform maintenance that more than doubles the size of that table cause us to re-evaluate this plan? This is a question for Joe Mulloy (Deactivated).
  3. What do we need to do to ensure that Django migrations continue to work throughout this process? Especially after we switch over to the new table?
    1. Will Django notice that the column type has changed? No.
    2. Can we fake a migration that gets Django's state back in sync with reality? Yes, though it seems like if we have already dropped the FKs it's a SQL no-op anyway.
      1. At what point do we do that for each of the tables that matter (courseware_studentmodule, courseware_studentmodulehistory, courseware_studentmodulehistoryextended)? Adding this to the runbook.


Per-environment runbook (in progress)

  1. Have we pruned coursewarehistoryextended_studentmodulehistoryextended? Should we?
  2. In MySQL make sure we have dropped the FK from courseware_studentmodule to courseware_studentmodulehistory
    1. This is the unused "local" table, not the extended one in it's own database.
    2. It is confirmed to be empty on stage, edge, and prod read replicas. Not sure if they are actually empty on masters.
  3. In MySQL alter courseware_studentmodulehistory to have a BIGINT student_module_id
    1. Again this should be empty and unused
  4. Do test gh-ost runs for courseware_studentmodule
    1. Use the gh-ost cheat sheet for "Connect to replica, migrate on master": https://github.com/github/gh-ost/blob/master/doc/cheatsheet.md#a-connect-to-replica-migrate-on-master
      1. Confirm the environment is set up correctly
        1. Is using Statement Based Replication
        2. Replica we want to use is configured with binary logs enabled (log_bin, log_slave_updates)
        3. binlog_format=ROW (gh-ost can apply the latter for you)
      2. Confirm the RDS environment has enough resources for two copies of the table!
    2. Perform a no-op test run to make sure everything is set up correctly
      1. The "alter" flag we want to run is: "change id bigint unsigned not null auto_increment"

    3. Test run the migration on a replica only
      1. Monitor and see if the throttling variables need to be tweaked
      2. Confirm that the data all looks good and makes sense
  5. Do test gh-ost runs for coursewarehistoryextended_studentmodulehistoryextended (the one in its own database)
    1. This can be done at the same time as the studentmodule tests to save time
    2. The playbook is the same as studentmodule except for the "alter" flag, database, and table to run on):
      1. The "alter" flag we want to run is: "change student_module_id bigint unsigned not null"
  6. Run the courseware_studentmodule migration on the master
    1. Monitor and tweak throttling as needed
    2. We should use the cut over flag file to make sure we are in the office and ready when it comes time to flip the switch: https://github.com/github/gh-ost/blob/master/doc/command-line-flags.md#postpone-cut-over-flag-file
    3. Confirm that the data looks good
    4. Cut over to the new table and monitor
  7. Run the coursewarehistoryextended_studentmodulehistoryextended migration on the master
    1. This can also be done at the same time as the studentmodule migration to save time
    2. Play is the same as studentmodule
  8. Trickle delete the rows from the old coursewar e_studentmodule table, wait some time, then drop it: https://github.com/github/gh-ost/issues/307
    1. We will likely want to create a custom script / management command for the deletes
  9. Trickle delete the rows from the old coursewarehistoryextended_studentmodulehistoryextended table, wait some time, then drop it: https://github.com/github/gh-ost/issues/307
    1. Again this can be done at the same time as the deletes on studentmodule
    2. We will likely want to create a custom script / management command for the deletes
  10. Create Django migrations and fake them to get Django up to date with the changes:
    1. StudentModule model:
      1. ADD
        1. id = UnsignedBigIntAutoField(primary_key=True) # pylint: disable=invalid-name
      2. This replaces the default Django id
      3. This also includes the ALTER for StudentModuleHistory
    2. StudentModuleHistory model:
      1. CHANGE
        1. student_module = models.ForeignKey(StudentModule, db_index=True, db_constraint=False, on_delete=models.CASCADE)
    3. StudentModuleHistoryExtended model:
      1. Does not seem to need / create a migration or changes to the model definition
    4. Once these migrations are merged we should follow the usual procedure for faking migrations. We have one of those, right? It looks like we did some for Ginkgo.
  11. Profit