courseware_studentmodule Table Refactoring

Overview

The courseware_studentmodule database table in edx-platform contains the current state of every student in every piece of stateful content (questions, completion checkboxes, etc.) in every course in the system. On a site the size of edx-platform with the current table design, this adds up to several terabytes of data. Additionally, the table design uses far too many strings as indexed ID fields, which both bloats the table size and hurts index performance. The resulting operational burden has delayed the upgrade of the edx.org LMS to MySQL 8 by a month and counting, and other large site operators may start to encounter similar growing pains. It’s time to refactor this table.

Suggested Improvements

This isn’t yet a complete list, just the most potentially impactful suggestions seriously proposed so far.

  1. At least on edx.org, there’s a redundant index (courseware_studentmodule_42ff452e). Remove this.

  2. Replace indexed varchar fields with integer foreign keys to new entity tables. (module_id and module_type are the top candidates.). This should dramatically improve the table's size and performance, although it would be a difficult migration for large sites and new systems would be needed to keep the entity tables current as course content is added and updated.

  3. Partition into multiple tables, perhaps by module_type. There wouldn’t need to be a separate table for each possible type (especially as it’s extensible), but splitting out the few most common ones could help a lot.

Complications

  • The table is massive. Even simple schema changes on an idle replica with no query activity can take hours or days for edx.org.

  • There is a very high volume of activity on the table at all times, making some schema change operations difficult even if the bulk of the work can be done in the background or on a replica.

  • The existing indexes are mostly across string fields, and their poor performance makes schema changes even slower. But they can’t be even temporarily dropped on the primary DB without crippling site performance.

  • There is a related (also large) courseware_studentmodulehistory table that will also need to be updated, in a properly coordinated manner.

  • Any changes made will impact thousands of Open edX site operators across the world, but 2U engineers essentially have to drive the implementation as the migration plan will need to be designed around edx.org operational considerations to have any chance of successful deployment on that site. As it is, even fairly trivial schema changes are proving nearly impossible to deploy.

Stakeholders

  • 2U teams:

    • Arch-BOM - maintainers of edx-platform as a whole, and home of a few of the most senior Open edX developers

    • Aurora - maintainers of the courseware Django application within edx-platform

    • TNL - primary developers of new features and refactoring involving XBlocks, modulestore, and learning content management

    • SRE - operational responsibilities for the production LMS/Studio databases on edx.org

  • Axim: custodians of the Open edX project as a whole

  • Other site operators (OpenCraft, Edly, edunext, etc.): will all need to perform the ultimately implemented migration, and will be very sensitive to any non-standard manual steps

  • Percona: DBA consultants working with 2U to assist with database upgrades and optimizations

Communications

References

Other Notes

  • “So one note when you create the lookup table for module_id: you'll probably get a lot of errors if you lazy-write the values, since there are going to potentially be many students hitting it and courseware transactions can be very slow. So you might want to have Studio do the writing at the time the content is first created. (There may be some weird edge cases like CCX and other things I'm forgetting, so lazy-fallback-with-logging is probably also a good idea).” - @Dave Ormsbee (Axim)

  • Site operators can add custom XBlocks, which adds to the possible values of module_type. So a unique table for each possible type isn’t possible.

  • “FWIW, module_type is obviously really unevenly distributed. I don't know much about how partitions work in MySQL, but intuitively, it feels like partitioning by some hash of the user_id would give the best distribution of load and also the least number of queries that have to cross partitions.” - @Dave Ormsbee (Axim)

Current Status

Selecting a tech lead to kick off coordination of an implementation plan. Likely to be a member of Arch-BOM.