CSM Primary Key Problems and Solutions
See PLAT-2437. Courseware's Student Module table is very big (2.1tb) and running out of signed integers for its id field. We've decided to increase the field to an unsigned bigint as a short term fix to allow more room for growth. We've looked at a few possible ways of migrating CSM's id field without causing excessive downtime or data loss:
- Just make a standard Django migration and use the usual process.
- This would likely result in days of downtime while the 2.1tb table was locked and copied over.
- Create a new MySQL cluster from a backup, let replication catch up, perform the alter on the new primary, wait for replication to catch up again, swap clusters.
- This could work, but would be expensive and could cause potential downtime / data loss during the cluster swap.
- Create a ghost table, perform the alter on it, add triggers for row modifications, and copy data to it in the background. When it's caught up, rename the table into place. (ex: pt-online-schema-change)
- Triggers are known to be flaky and can cause some issues in these scenarios: https://github.com/github/gh-ost/blob/master/doc/why-triggerless.md
- GitHub's MySQL schema migration tool: https://github.com/github/gh-ost
- Seems to provide everything we're looking for with a sensible ability to back off the migration under load, and proven under high load
- There are caveats when using RDS and Aurora that may be an issue for us, but seem possible to work around: https://github.com/github/gh-ost/blob/master/doc/rds.md
Given these options we've chosen 4 as the safest, most cost-effective path forward and have stared a runbook for it. We have also tried to catalog all of the places where the table is currently touched in the platform.
For any solution we will probably have to drop the foreign key from the edxapp courseware_studentmodulehistory (not the one in its own database)
, according to the prod database and grepping through code there are no other FKs to / from this table, which seems suspicious. I've confirmed that in the prod, edge, and stage replicas this table is empty so it should be a quick change.