See PLAT-2437. Courseware's Student Module table is very big (2.1tb) and running out of signed integers for its id field. We've decided to increase the field to an unsigned bigint as a short term fix to allow more room for growth. We've looked at a few possible ways of migrating CSM's id field without causing excessive downtime or data loss:
- Just make a standard Django migration and use the usual process.
- This would likely result in days of downtime while the 2.1tb table was locked and copied over.
- Create a new MySQL cluster from a backup, let replication catch up, perform the alter on the new primary, wait for replication to catch up again, swap clusters.
- This could work, but would be expensive and could cause potential downtime / data loss during the cluster swap.
- Create a ghost table, perform the alter on it, add triggers for row modifications, and copy data to it in the background. When it's caught up, rename the table into place. (ex: pt-online-schema-change)
- Triggers are known to be flaky and can cause some issues in these scenarios: https://github.com/github/gh-ost/blob/master/doc/why-triggerless.md
- GitHub's MySQL schema migration tool: https://github.com/github/gh-ost
- Seems to provide everything we're looking for with a sensible ability to back off the migration under load, and proven under high load
- There are caveats when using RDS and Aurora that may be an issue for us, but seem possible to work around: https://github.com/github/gh-ost/blob/master/doc/rds.md
Given these options we've chosen 4 as the safest, most cost-effective path forward and have stared a runbook for it. We have also tried to catalog all of the places where the table is currently touched in the platform.