courseware_studentmodule Table Refactoring
Overview
The courseware_studentmodule
database table in edx-platform contains the current state of every student in every piece of stateful content (questions, completion checkboxes, etc.) in every course in the system. On a site the size of edx-platform with the current table design, this adds up to several terabytes of data. Additionally, the table design uses far too many strings as indexed ID fields, which both bloats the table size and hurts index performance. The resulting operational burden has delayed the upgrade of the edx.org LMS to MySQL 8 by a month and counting, and other large site operators may start to encounter similar growing pains. It’s time to refactor this table.
Suggested Improvements
This isn’t yet a complete list, just the most potentially impactful suggestions seriously proposed so far.
At least on edx.org, there’s a redundant index (
courseware_studentmodule_42ff452e
). Remove this.Replace indexed varchar fields with integer foreign keys to new entity tables. (
module_id
andmodule_type
are the top candidates.). This should dramatically improve the table's size and performance, although it would be a difficult migration for large sites and new systems would be needed to keep the entity tables current as course content is added and updated.Partition into multiple tables, perhaps by
module_type
. There wouldn’t need to be a separate table for each possible type (especially as it’s extensible), but splitting out the few most common ones could help a lot.
Complications
The table is massive. Even simple schema changes on an idle replica with no query activity can take hours or days for edx.org.
There is a very high volume of activity on the table at all times, making some schema change operations difficult even if the bulk of the work can be done in the background or on a replica.
The existing indexes are mostly across string fields, and their poor performance makes schema changes even slower. But they can’t be even temporarily dropped on the primary DB without crippling site performance.
There is a related (also large)
courseware_studentmodulehistory
table that will also need to be updated, in a properly coordinated manner.Any changes made will impact thousands of Open edX site operators across the world, but 2U engineers essentially have to drive the implementation as the migration plan will need to be designed around edx.org operational considerations to have any chance of successful deployment on that site. As it is, even fairly trivial schema changes are proving nearly impossible to deploy.
Stakeholders
2U teams:
Arch-BOM - maintainers of edx-platform as a whole, and home of a few of the most senior Open edX developers
Aurora - maintainers of the
courseware
Django application within edx-platformTNL - primary developers of new features and refactoring involving XBlocks, modulestore, and learning content management
SRE - operational responsibilities for the production LMS/Studio databases on edx.org
Axim: custodians of the Open edX project as a whole
Other site operators (OpenCraft, Edly, edunext, etc.): will all need to perform the ultimately implemented migration, and will be very sensitive to any non-standard manual steps
Percona: DBA consultants working with 2U to assist with database upgrades and optimizations
Communications
Any changes with the potential to impact all Open edX site operators should be posted to Open edX discussions with sufficient time for review and feedback. One discussion thread for the project may be sufficient.
Text chats for which the Discourse latency is too high should go in a public channel in the Open edX Slack workspace unless confidential 2U information is involved.
@Dave Ormsbee (Axim) wants to be involved
Aurora and TNL don’t have time to lead the implementation, but want to be kept informed and given opportunities to provide feedback.
References
The table is defined by the model in https://github.com/openedx/edx-platform/blob/master/lms/djangoapps/courseware/models.py . The StudentModuleHistory model is also defined in this file.
The primary interface for accessing the table is defined in https://github.com/openedx/edx-platform/blob/ed56222929eddc64859ccb2037b81a414655fb0d/lms/djangoapps/courseware/user_state_client.py#L63 . It was designed to support different storage backends, but only the MySQL table has ever been used.
Other Notes
“So one note when you create the lookup table for module_id: you'll probably get a lot of errors if you lazy-write the values, since there are going to potentially be many students hitting it and courseware transactions can be very slow. So you might want to have Studio do the writing at the time the content is first created. (There may be some weird edge cases like CCX and other things I'm forgetting, so lazy-fallback-with-logging is probably also a good idea).” - @Dave Ormsbee (Axim)
Site operators can add custom XBlocks, which adds to the possible values of
module_type
. So a unique table for each possible type isn’t possible.“FWIW, module_type is obviously really unevenly distributed. I don't know much about how partitions work in MySQL, but intuitively, it feels like partitioning by some hash of the user_id would give the best distribution of load and also the least number of queries that have to cross partitions.” - @Dave Ormsbee (Axim)
Current Status
Selecting a tech lead to kick off coordination of an implementation plan. Likely to be a member of Arch-BOM.