Migrating Courses to Learning Core
@Kyle McCormick @Dave Ormsbee (Axim)
2024-01-09
[kyle] I would like to get going on a course-Components-in-LC prototype. Anything to know before I dive in?
What are the areas of risk worth focusing on?
Just getting my hands dirty with LC
APIs are pretty low-level, have been adding new convenience method, would like to add more of those. Could PR these in separate from the POC
Trying entirely using the new LearningCoreRuntime for verticals and see what breaks
Serving content outside the tree…. ?
[dave] taking a stab at:
intermediate ADR for serialization of library format
bringing courses that already exist in split into content libraries
2024-12-18
Talk is accepted!
Last time: modelling dynamic containers
Student State
CSM is too big to modify
New tables, 1-1 with those, gradually pull data over
K: If we have a new course key or a way of showing a course is entirely built with new data models, could student state go directly into those without double lookup?
Depends on how much backcompat we want
There is technically a student state client
But lots of plugin code pulls in CSM models and just uses those
Analytics too - data package exports
We are likely stuck with CSM for a while, and may need to even push CSM updates up to the new models
We could cut off CSM usage eventually, but it will take time
K: Courses v2 could have a separate place for student state, while v1 courses are still CSM backed
We would want authors to be able to port their courses forward and back and forth v1<->v2
Do we need to migrate off of CSM?
Moving content to LC is a precursor to moving student state off of CSM
One of the big advantages to LC is that storing state becomes much more efficient
Because of foreign keys to content rather than varchar keys – much smaller indexes
Also ability to reference versions is useful
But as long as we’re maintaining write compatibility (i.e. writing out to CSM as well), we don’t get these benefits.
Paths:
0. Keep super-complete compatibility by writing CSM updates up to LC
1. Keep complete compatibility by writing LC updates down to CSM
This would break plugins that write directly to CSM
2. Keep CSM as a read-only fallback if things aren’t found in new table
This would break plugins that read directly from CSMapis
Maybe ProxyModel to maintain the API interface that plugins expect, without actually storing things in CSM?
3. Kill CSM altogether (long term goal)
Handful of other tables also have a lot of student state references to content, e.g. grades storage, completion.
Old Notes
Talk Proposal
Title
TBD
Description (<500 words)
TBD
Type
45 minute Talk
Target Audience
Open edX developers. Assumes familiarity with core Open edX authoring and learning features, including XBlock. Assumes general knowledge of Python, Django, SQL.
Proposal
TBD
Rough Talk Outline
What are ModuleStore, ContentStore? How do we store content in them today? Why did we do it like that?
What are the pitfalls of our current system?
Mongo required → more overhead for site operators, more upgrades for maintainers to do.
Tightly coupled to other edx-platform code → bugs and complexity for core developers.
No relational features like foreign keys, joins, constraints, etc. → fewer correctness guarantees and worse performance.
Assumes everything is a course or part of a course → either need to hack around that (libraries v1) or sacrifice potentially interesting features (standalone units)
Huge structure document → slow reads
Full course traversal required for many computations → limits on what we can efficiently query about any given piece of content
Assumes XBlock at every level → Features at any level of the content hierarchy are shoe-horned into the XBlock view of the world, which in some ways is too constraining (assumes Python, assumes in-process, etc), sometimes not constraining enough (always able to define get_children, access control, etc. using completely arbitrary Python), and often both at the the same time.
XBlocks can do anything and everything → variety of problems listed above
How are we working around the pitfalls today?
Block Transformers → More efficient reads at any level of the course
Downside: ModuleStore wackiness * Graph Theory → Sad Developers
Downside: Parallel implementations of content sequencing and access control → Bugs!
Downside: Still no relational guarantees
Learning Sequences → Super efficient reads and relational guarantees for sections and subsections
Downside: Does not go below the subsection level
Downside: Parallel implementations of content sequencing and access control → Bugs!
Downside: Learning-only. Cannot be used for authoring.
Be very very careful → Only handful of developers trust themselves to work in code that touches ModuleStore and ContentStore. We mostly don’t change it. When we do, we proceed very cautiously and slowly, but still break things sometimes.
What’s Learning Core, how does it address these pitfalls, and how are we using it today?
Ground-up replatforming of Open edX learning content
(Explain more)
(Explain how handles the pitfalls above)
(Explain Content Libraries V2)
How will we migrate content?
Concept: Slicing the problem Horizontally vs Vertically
Vertical method:
Make a minimal top-to-bottom (course-to-component) working reimplementation of courses in Learning Core. OK if some features are missing.
Allow experimental migration of courses which match that constrained feature set.
Iteratively add features over time, allowing migration of more and more courses
Finally, once the desired features are all migrated, deprecate the remaining features and remove ModuleStore & ContentStore
(This is how we did the Learning MFE migration)
Horizontal method:
Choose a feature of courseware / level of the course try, and reimplement it fully for all courses.
For example: migrate files & uploads, then components, then units, then sections & subsections, then courses.
Hybrid method (planned as of time of writing):
Use horizontal method for certain low-hanging features (files & uploads, components)
Use vertical method for remaining features (unit and above).
Additional Notes
TBD
2024-11-20
Sumac
Learning Core is getting rolled out to prod for the first time
LC Components, Collections, Assets for Libraries v2
Teak
LC Units, Sections in Subsections for Libraries v2
Migration path for Libraries v1->v2
Ulmo
Remove Libraries v1
Standalone assets (files and uploads) for Libraries v2
Standalone assets (files and uploads) for Courses
LearningPackage for each course
LearingCoreContentstore
Slow, major migration (~1TB for 2U?)
Rollout questions… all-at-once, or mixed mode?
Benefits of asset conversion would be:
Cheaper storage
Faster/easier querying (particularly if we hook it up to a search backend).
May cause higher latency if we maintain the same URLs
Verawood
LC Components for Libraries
Components with children?
Some in edx-platform
a/b test, randomize, etc
Each of these is a Selector
Selected content is a Variant (some list of PubEnts)
selector.get_or_create_variant_for_user(args) -> Variant
Is this a generalization of learning_sequences?
Kinda but not really. LSeqs is built as a pipeline of processors which can hide or remove content. Intersection of the resulting sets is the users' outline
Some outside? Do we support? Or deprecate customization of
get_children
?Do not break callers of
get_children
Do deprecate the ability to customize it, though
End result is that get_children is the responsibility of the runtime / edx-platform:
class XBlock: def get_children(...): yield from self.runtime.get_children(self.usage_key)
Ideas
Just migrate leaf components
get_item(leaf_block) → LC
get_item(parent_block) → splitmongo
Implement Unit and below at LC level
Thing to watch out for: how to juggle the two different runtimes used for field data persistence.
Misc
Eventually
Leaf blocks can define views with arbitrary python
Parent blocks (containers?) are declarative, an external system looks at the rules they declare to determine the course tree
2024-11-13
Libraries prototype: https://github.com/openedx/edx-platform/pull/35758
.
Background assumption:
Remaining shims?
Some layer of basic shimming
LearningCoreModuleStore - thinnish shim layer
80/20
At least one overlap release
Progressive (course-at-a-time) cutover
Long term: No Mongo
Remove Mongo first? (via SplitDjangoModuleStore)
Issues: Latency of S3 is much less predictable than Mongo
Issues: Length of data migration
Because V1 content libraries queries the structure document at different versions, we’d need to move a ton of structure docs over
But, v1 content libraries will be gone by Ulmo
Issue: We are still reading CourseBlocks (the root ones) from Old Mongo
Current state
Active Versions is read from and wrote to both Mongo and MySQL
MySQL is backfilled
Pruning would need to be ported over
Latency is a worry here. We could do more caching, but it would increase the memory footprint.
Structure docs are in Mongo
Definition docs are in Mongo
Or remove Mongo along with ModuleStore removal?
Standalone items:
Files and uploads
We can emulate these API promises fairly easily.
Vertical or horizontal migration?
Vertical: Top-down, one entire course at a time
Horizontal: Components all at a time, units all at a time, etc. up the tree
Example: Components become backed by Learning Core. modulestore still exists, but get_item delegates to LC when it’s a component.
What can be broken? Talk to Jenna. For example:
inheriting defaults
FBE
No breakage > Intentional breakage with DEPR > unintentional breakage
Multiple course runs in the same learning package?
This would be ideal.
Prototype components in Learning Core
Transactions.
Mongo commits everything immediately
MySQL commits it at the end of the request
CourseOverviews, Block Transformers, Learning Sequences