Migrating Courses to Learning Core
@Kyle McCormick @Dave Ormsbee (Axim)
2025-05-15
@Braden MacDonald opened a PR for OutlineRoot
feat: OutlineRoot model by bradenmacdonald · Pull Request #316 · openedx/openedx-learning
Why did we want to put CourseRun in edx-platform?
Dave: dependencies we’d need to pull into learning core to have CourseRun there…
cohorts
grading policy
etc
Dave: could see more things moving into learning core at a time
Braden: What about a core CourseRun model in learning core, which edx-platform and other plugins could hang things off of (e.g. days_early_for_beta)
Braden: Learning Packages w.r.t. re-runs
In theory, a LP could have several re-runs
Hypothetically… we have a LP with a course run
And we copy the outline root
But we probably want to share the sections, subsections, etc
But if I want to edit some content in the re-run, then how does that not modify the original course?
Do we need to deepcopy the entire outline immediately upon rerun?
Dave: An LP makes sense when it’s the same set of authors
Kyle: thoughts about the outline
Option 1: Do a re-run, copy the whole outline immediately.
Braden: Right now each component has one pointer to a current Draft and a current Published. Could store for each branch, each representing a different run (draft_run1 pointer, publish_run1 pointer; draft_run2 pointer, published_run2 pointer).
Kyle: Should there be a fkey to PE that is branch.
instead of version_num being unique, (version_num, branch) is unique
this makes cloning/reruns very cheap
or could hopskotch and mix the history, requires more changes
Question: Do we want to have a separate namespace per course-run? This is not the case if we do branch encoding in the version.
Braden & Dave: Would be nice to have a working prototype where we can cobble this stuff together:
Kyle: should have a sandbox
Later:
Alternate proposal from @Braden MacDonald :
What if we try for a more git-like model where our *Version models (e.g. ComponentVersion, PublishableEntityVersion) no longer point back to their unversioned model (Component, PublishableEntity), but instead just point to their previous *Version. Then, we can have multiple instances of a Component (with different keys) pointing at the same data (same *Version). They would have the same history, but if you modify one it would “fork” the versions, and diverge from the other.
Example:
Component Text1 exists, pointing to ComponentVersion “anonv1”. You edit it, creating ComponentVersion “anonv2”, previous version “anonv1”. Component Text1’s Draft pointer points to “anonv2” and published points to nothing.
Now you duplicate it (or re-run the course or make a CCX variant, or whatever). Component Text2 now exists (different key, same learning package) and its Draft pointer also points to “anonv2”. Now both Text1 and Text2 share the same data and same version history, but they will diverge if you edit them.
When we create a new course run from an existing course run, we would have to duplicate all the Components and Containers, but leave the *Version models alone. This makes it a much more efficient operation, because most of the data is in the *Version models, and it also means the full history is preserved.
As for keys, let’s say our CourseRun model has a “key_prefix” (or branch). Then if we have a library, course A run 2023, and course A run 2024 all in the same learning package, the library’s keys can be unprefixed, and the two course runs can have prefixes like “A2023:” and “A2024:” so that there are no collisions among the library and course runs - each has its own namespace without affecting the data model. (Of course you can just as easily make a namespace string field / db column on PublishableEntity if you want this to be more formal.) Example: opaque key “course-v2:org+A+2023:block:problem:p1” maps to PublishableEntity key “A2023:problemp1” and and the exact same component in the other run with opaque key “course-v2:org+A+2024:block:problem:p1” maps to “A2024:problemp1”
Upsides of this approach: very fast, has nice copy-on-write semantics, separate namespaces per run, much simpler/cleaner history.
Downsides: pretty significant change to the data model, ComponentVersions would not be deleted when the Component is deleted (occasional cleanup process required if you want to delete them), version numbers would be incrementing in jumps based on the highest version used in the learning package not the higher version used for that specific component.
2025-05-06
Dave: update
implicit compile step for representing components and saving xblock fields
LC runtime doesn’t support parents and children
tie these realized fields to PEVersion rather than ComponentVersion. That way, anything with field data can use it
split modulestore does allow another “backend” than mongo, dave is trying that
one row per course per published verison of structure document
when we publish again, row gets replaced
(rather than pruning old structure docs)
do we need a draft version of the structure doc?
yes, for preview, but not necessarily for POC
Kyle: Data modelling
class Selector(PubEnt)
class SelectorVersion(PubEntVer)
partition: Partition
variants (reverse relation from Variant.selector_version)
class SplitTestVersion(SelectorVersion)
class ItemBankVersion(SelectorVersion)
count: int
class GradeGateVersion(SelectorVersion)
a hypothetical custom selector
selects a child based on the user’s current overall grade
class Variant(Model) [ alt name: class Selection(Model) ? ]
entity_list: EntityList
selector_version: SelectorVersion
group: Group|null
variant is valid iff all non-null of [selector_version, group] both match the query
otherwise, need to re-invoke selection process to determine Variant
old matching Variant factors in when determining a new Variant
new Variant may need to be generated if it does not exist
Partition
Group
2025-04-29
What’s the most minimal MVP we can do for getting course content into learning core
Learning core backend for SplitModuleStore
class ComponentVersionXBlockData(Model): cv = OneToOneField(ComponentVersion) content_fields = JSONField() settings_fields = JSONField()
Generating this would need us to instantiate the xblock runtime
Up front as a part of migration?
Braden: Would this be a temp thing?
Dave: Field data split is long term thing.
Braden: How do we handle containers?
Kyle: Simple/dumb data hanging off the containers
Braden: Switch on a per-course basis?
Yes, course waffle flags maybe?
But it would be bad to do this on a per-user basis lol
Braden: Is the split shim readonly or read/write?
Dave: likes readonly
consensus on split-shim being read-only
Management command to convert.
Static assets?
Dave: not as important, but should be much simpler. Would like to get this in for MVP
Kyle: Question of what data models for course looks like?
the container that is the course outline, “supersection”
does this enforce 3-level hierarchy, or is it more flexible?
Braden: lean towards not enforcing the three-level hierarchy
Consensus on this.
the thing that covers all the other course stuff–policy, textbooks, assets, etc. Different models, combine to form the Learning Context.
Could have a Course model with ^ and a linked OutlineRoot model that has child Sections (or Subsections etc if not a 3-layer course)
Braden: We have punted on randomized containers
Let’s get Split Test Block working
What/how we want to present
The MVP
Mgmt command to convert Split course into LC
Readonly shim of Split for LMS (@Dave Ormsbee (Axim) )
OutlineRoot (openedx-learning) and CourseRun (edx-platform) models (@Braden MacDonald )
Static files
SplitTestBlock (@Kyle McCormick)
(developer mode)
2025-04-02
Pruning
We want to be able to prune a PEV. What is the implication on DraftChangeLogRecord and DraftSideEffect?
we must copy all data from pruned PEVs to the change records
give up on pruning. need to sketch out numbers, but the worst offender in modulestore was rewriting of structure docs and static assets
cost of static assets will be much less
hybrid pruning: leave the PEV, but delete the things that hang off of it. publishing would need to be able to have a flag for pruning
deeper pruning:
PublishLog has to stick around, but DraftChangeLog gets less useful over time.
Dave: Proposal for field data storage as a separate model that hangs off of ComponentVersion (maybe PublishableEntityVersion) and has separate fields for settings vs. definition scopes.
class ComponentVersionXBlockData(Model): cv = OneToOneField(ComponentVersion) content_fields = JSONField() settings_fields = JSONField() # OR class ComponentVersionXBlockData(Model): cv = OneToOneField(ComponentVersion) fields = JSONField() # OR class ComponentVersionContentData(Model): cv = OneToOneField(ComponentVersion) fields = JSONField() class ComponentVersionSettingsData(Model): cv = OneToOneField(ComponentVersion) fields = JSONField()
More optimized for pulling an entire component’s data and/or pulling many components' data
Dave and Kyle agree on the desirability of this one, given that we want block-specific granular models to be built out
Kyle (counter-proposal):
class ComponentVersionXBlockData(Model): cv = FK(ComponentVersion) field_name = CharField() field_value = JSONField() is_content_scope = BooleanField() class Meta: unique_together = (("cv", "field_name"))
More optimized for granular acces
encroaches on a territory we’d like to see for specific xblocks
class ProblemBlockData(XBlockDataModelBase): cv = FK(ComponentVersion) #... etc class ProblemBlockResponse(Model): problem_block = FK(ProblemBlockDatha) ...
2025-03-05
[kyle] Current thinking: after containers land in libraries, open a prototype branch where we try to get some course content, getting a hacky course running under Learning Core, e.g. single section → subsection → unit.
Still not clear where the switching layer would be. Current thought: Ignore modulestore for now.
Tentative prototype:
new table in the django admin interface-course run, title, start date, end date, supersection fkey, and that’s from a learning context--initial editing in the library
for backwards compatibility, just use existing keys, i.e. “course-v1”
new course run table point to Container or Supersection? (kyle: suggestion for a legacy course run shim that would inject the hierarchy expected)
collection for other associated content that’s not hanging off the root?
[kyle] Should record decisions that deprecate XBlock capabilities, pave way to remove student_view of sequential
2025-02-05
[kyle] start on prototype as mentioned below.
2025-01-09
[kyle] I would like to get going on a course-Components-in-LC prototype. Anything to know before I dive in?
What are the areas of risk worth focusing on?
Just getting my hands dirty with LC
APIs are pretty low-level, have been adding new convenience method, would like to add more of those. Could PR these in separate from the POC
Trying entirely using the new LearningCoreRuntime for verticals and see what breaks
Serving content outside the tree…. ?
[dave] taking a stab at:
intermediate ADR for serialization of library format
bringing courses that already exist in split into content libraries
2024-12-18
Talk is accepted!
Last time: modelling dynamic containers
Student State
CSM is too big to modify
New tables, 1-1 with those, gradually pull data over
K: If we have a new course key or a way of showing a course is entirely built with new data models, could student state go directly into those without double lookup?
Depends on how much backcompat we want
There is technically a student state client
But lots of plugin code pulls in CSM models and just uses those
Analytics too - data package exports
We are likely stuck with CSM for a while, and may need to even push CSM updates up to the new models
We could cut off CSM usage eventually, but it will take time
K: Courses v2 could have a separate place for student state, while v1 courses are still CSM backed
We would want authors to be able to port their courses forward and back and forth v1<->v2
Do we need to migrate off of CSM?
Moving content to LC is a precursor to moving student state off of CSM
One of the big advantages to LC is that storing state becomes much more efficient
Because of foreign keys to content rather than varchar keys – much smaller indexes
Also ability to reference versions is useful
But as long as we’re maintaining write compatibility (i.e. writing out to CSM as well), we don’t get these benefits.
Paths:
0. Keep super-complete compatibility by writing CSM updates up to LC
1. Keep complete compatibility by writing LC updates down to CSM
This would break plugins that write directly to CSM
2. Keep CSM as a read-only fallback if things aren’t found in new table
This would break plugins that read directly from CSMapis
Maybe ProxyModel to maintain the API interface that plugins expect, without actually storing things in CSM?
3. Kill CSM altogether (long term goal)
Handful of other tables also have a lot of student state references to content, e.g. grades storage, completion.
Old Notes
Talk Proposal
Title
TBD
Description (<500 words)
TBD
Type
45 minute Talk
Target Audience
Open edX developers. Assumes familiarity with core Open edX authoring and learning features, including XBlock. Assumes general knowledge of Python, Django, SQL.
Proposal
TBD
Rough Talk Outline
What are ModuleStore, ContentStore? How do we store content in them today? Why did we do it like that?
What are the pitfalls of our current system?
Mongo required → more overhead for site operators, more upgrades for maintainers to do.
Tightly coupled to other edx-platform code → bugs and complexity for core developers.
No relational features like foreign keys, joins, constraints, etc. → fewer correctness guarantees and worse performance.
Assumes everything is a course or part of a course → either need to hack around that (libraries v1) or sacrifice potentially interesting features (standalone units)
Huge structure document → slow reads
Full course traversal required for many computations → limits on what we can efficiently query about any given piece of content
Assumes XBlock at every level → Features at any level of the content hierarchy are shoe-horned into the XBlock view of the world, which in some ways is too constraining (assumes Python, assumes in-process, etc), sometimes not constraining enough (always able to define get_children, access control, etc. using completely arbitrary Python), and often both at the the same time.
XBlocks can do anything and everything → variety of problems listed above
How are we working around the pitfalls today?
Block Transformers → More efficient reads at any level of the course
Downside: ModuleStore wackiness * Graph Theory → Sad Developers
Downside: Parallel implementations of content sequencing and access control → Bugs!
Downside: Still no relational guarantees
Learning Sequences → Super efficient reads and relational guarantees for sections and subsections
Downside: Does not go below the subsection level
Downside: Parallel implementations of content sequencing and access control → Bugs!
Downside: Learning-only. Cannot be used for authoring.
Be very very careful → Only handful of developers trust themselves to work in code that touches ModuleStore and ContentStore. We mostly don’t change it. When we do, we proceed very cautiously and slowly, but still break things sometimes.
What’s Learning Core, how does it address these pitfalls, and how are we using it today?
Ground-up replatforming of Open edX learning content
(Explain more)
(Explain how handles the pitfalls above)
(Explain Content Libraries V2)
How will we migrate content?
Concept: Slicing the problem Horizontally vs Vertically
Vertical method:
Make a minimal top-to-bottom (course-to-component) working reimplementation of courses in Learning Core. OK if some features are missing.
Allow experimental migration of courses which match that constrained feature set.
Iteratively add features over time, allowing migration of more and more courses
Finally, once the desired features are all migrated, deprecate the remaining features and remove ModuleStore & ContentStore
(This is how we did the Learning MFE migration)
Horizontal method:
Choose a feature of courseware / level of the course try, and reimplement it fully for all courses.
For example: migrate files & uploads, then components, then units, then sections & subsections, then courses.
Hybrid method (planned as of time of writing):
Use horizontal method for certain low-hanging features (files & uploads, components)
Use vertical method for remaining features (unit and above).
Additional Notes
TBD
2024-11-20
Sumac
Learning Core is getting rolled out to prod for the first time
LC Components, Collections, Assets for Libraries v2
Teak
LC Units, Sections in Subsections for Libraries v2
Migration path for Libraries v1->v2
Ulmo
Remove Libraries v1
Standalone assets (files and uploads) for Libraries v2
Standalone assets (files and uploads) for Courses
LearningPackage for each course
LearingCoreContentstore
Slow, major migration (~1TB for 2U?)
Rollout questions… all-at-once, or mixed mode?
Benefits of asset conversion would be:
Cheaper storage
Faster/easier querying (particularly if we hook it up to a search backend).
May cause higher latency if we maintain the same URLs
Verawood
LC Components for Libraries
Components with children?
Some in edx-platform
a/b test, randomize, etc
Each of these is a Selector
Selected content is a Variant (some list of PubEnts)
selector.get_or_create_variant_for_user(args) -> Variant
Is this a generalization of learning_sequences?
Kinda but not really. LSeqs is built as a pipeline of processors which can hide or remove content. Intersection of the resulting sets is the users' outline
Some outside? Do we support? Or deprecate customization of
get_children
?Do not break callers of
get_children
Do deprecate the ability to customize it, though
End result is that get_children is the responsibility of the runtime / edx-platform:
class XBlock: def get_children(...): yield from self.runtime.get_children(self.usage_key)
Ideas
Just migrate leaf components
get_item(leaf_block) → LC
get_item(parent_block) → splitmongo
Implement Unit and below at LC level
Thing to watch out for: how to juggle the two different runtimes used for field data persistence.
Misc
Eventually
Leaf blocks can define views with arbitrary python
Parent blocks (containers?) are declarative, an external system looks at the rules they declare to determine the course tree
2024-11-13
Libraries prototype: https://github.com/openedx/edx-platform/pull/35758
.
Background assumption:
Remaining shims?
Some layer of basic shimming
LearningCoreModuleStore - thinnish shim layer
80/20
At least one overlap release
Progressive (course-at-a-time) cutover
Long term: No Mongo
Remove Mongo first? (via SplitDjangoModuleStore)
Issues: Latency of S3 is much less predictable than Mongo
Issues: Length of data migration
Because V1 content libraries queries the structure document at different versions, we’d need to move a ton of structure docs over
But, v1 content libraries will be gone by Ulmo
Issue: We are still reading CourseBlocks (the root ones) from Old Mongo
Current state
Active Versions is read from and wrote to both Mongo and MySQL
MySQL is backfilled
Pruning would need to be ported over
Latency is a worry here. We could do more caching, but it would increase the memory footprint.
Structure docs are in Mongo
Definition docs are in Mongo
Or remove Mongo along with ModuleStore removal?
Standalone items:
Files and uploads
We can emulate these API promises fairly easily.
Vertical or horizontal migration?
Vertical: Top-down, one entire course at a time
Horizontal: Components all at a time, units all at a time, etc. up the tree
Example: Components become backed by Learning Core. modulestore still exists, but get_item delegates to LC when it’s a component.
What can be broken? Talk to Jenna. For example:
inheriting defaults
FBE
No breakage > Intentional breakage with DEPR > unintentional breakage
Multiple course runs in the same learning package?
This would be ideal.
Prototype components in Learning Core
Transactions.
Mongo commits everything immediately
MySQL commits it at the end of the request
CourseOverviews, Block Transformers, Learning Sequences