Migrating Courses to Learning Core

@Kyle McCormick @Dave Ormsbee (Axim)

2025-05-15

  • @Braden MacDonald opened a PR for OutlineRoot

    • feat: OutlineRoot model by bradenmacdonald · Pull Request #316 · openedx/openedx-learning

    • Why did we want to put CourseRun in edx-platform?

      • Dave: dependencies we’d need to pull into learning core to have CourseRun there…

        • cohorts

        • grading policy

        • etc

      • Dave: could see more things moving into learning core at a time

      • Braden: What about a core CourseRun model in learning core, which edx-platform and other plugins could hang things off of (e.g. days_early_for_beta)

    • Braden: Learning Packages w.r.t. re-runs

      • In theory, a LP could have several re-runs

        • Hypothetically… we have a LP with a course run

        • And we copy the outline root

        • But we probably want to share the sections, subsections, etc

        • But if I want to edit some content in the re-run, then how does that not modify the original course?

        • Do we need to deepcopy the entire outline immediately upon rerun?

      • Dave: An LP makes sense when it’s the same set of authors

    • Kyle: thoughts about the outline

      • Option 1: Do a re-run, copy the whole outline immediately.

        • Braden: Right now each component has one pointer to a current Draft and a current Published. Could store for each branch, each representing a different run (draft_run1 pointer, publish_run1 pointer; draft_run2 pointer, published_run2 pointer).

        • Kyle: Should there be a fkey to PE that is branch.

          • instead of version_num being unique, (version_num, branch) is unique

            • this makes cloning/reruns very cheap

          • or could hopskotch and mix the history, requires more changes

    • Question: Do we want to have a separate namespace per course-run? This is not the case if we do branch encoding in the version.

    • Braden & Dave: Would be nice to have a working prototype where we can cobble this stuff together:

      • Kyle: should have a sandbox

 

Later:

  • Alternate proposal from @Braden MacDonald :

    • What if we try for a more git-like model where our *Version models (e.g. ComponentVersion, PublishableEntityVersion) no longer point back to their unversioned model (Component, PublishableEntity), but instead just point to their previous *Version. Then, we can have multiple instances of a Component (with different keys) pointing at the same data (same *Version). They would have the same history, but if you modify one it would “fork” the versions, and diverge from the other.

      • Example:

        • Component Text1 exists, pointing to ComponentVersion “anonv1”. You edit it, creating ComponentVersion “anonv2”, previous version “anonv1”. Component Text1’s Draft pointer points to “anonv2” and published points to nothing.

        • Now you duplicate it (or re-run the course or make a CCX variant, or whatever). Component Text2 now exists (different key, same learning package) and its Draft pointer also points to “anonv2”. Now both Text1 and Text2 share the same data and same version history, but they will diverge if you edit them.

    • When we create a new course run from an existing course run, we would have to duplicate all the Components and Containers, but leave the *Version models alone. This makes it a much more efficient operation, because most of the data is in the *Version models, and it also means the full history is preserved.

    • As for keys, let’s say our CourseRun model has a “key_prefix” (or branch). Then if we have a library, course A run 2023, and course A run 2024 all in the same learning package, the library’s keys can be unprefixed, and the two course runs can have prefixes like “A2023:” and “A2024:” so that there are no collisions among the library and course runs - each has its own namespace without affecting the data model. (Of course you can just as easily make a namespace string field / db column on PublishableEntity if you want this to be more formal.) Example: opaque key “course-v2:org+A+2023:block:problem:p1” maps to PublishableEntity key “A2023:problemp1” and and the exact same component in the other run with opaque key “course-v2:org+A+2024:block:problem:p1” maps to “A2024:problemp1”

    • Upsides of this approach: very fast, has nice copy-on-write semantics, separate namespaces per run, much simpler/cleaner history.

    • Downsides: pretty significant change to the data model, ComponentVersions would not be deleted when the Component is deleted (occasional cleanup process required if you want to delete them), version numbers would be incrementing in jumps based on the highest version used in the learning package not the higher version used for that specific component.

2025-05-06

  • Dave: update

    • implicit compile step for representing components and saving xblock fields

    • LC runtime doesn’t support parents and children

    • tie these realized fields to PEVersion rather than ComponentVersion. That way, anything with field data can use it

    • split modulestore does allow another “backend” than mongo, dave is trying that

    • one row per course per published verison of structure document

      • when we publish again, row gets replaced

      • (rather than pruning old structure docs)

      • do we need a draft version of the structure doc?

        • yes, for preview, but not necessarily for POC

    •  

  • Kyle: Data modelling

    • class Selector(PubEnt)

    • class SelectorVersion(PubEntVer)

      • partition: Partition

      • variants (reverse relation from Variant.selector_version)

    • class SplitTestVersion(SelectorVersion)

    • class ItemBankVersion(SelectorVersion)

      • count: int

    • class GradeGateVersion(SelectorVersion)

      • a hypothetical custom selector

      • selects a child based on the user’s current overall grade

    • class Variant(Model) [ alt name: class Selection(Model) ? ]

      • entity_list: EntityList

      • selector_version: SelectorVersion

      • group: Group|null

      • variant is valid iff all non-null of [selector_version, group] both match the query

        • otherwise, need to re-invoke selection process to determine Variant

        • old matching Variant factors in when determining a new Variant

        • new Variant may need to be generated if it does not exist

    • Partition

    • Group

2025-04-29

  • What’s the most minimal MVP we can do for getting course content into learning core

    • Learning core backend for SplitModuleStore

    • class ComponentVersionXBlockData(Model): cv = OneToOneField(ComponentVersion) content_fields = JSONField() settings_fields = JSONField()
    • Generating this would need us to instantiate the xblock runtime

      • Up front as a part of migration?

    • Braden: Would this be a temp thing?

      • Dave: Field data split is long term thing.

    • Braden: How do we handle containers?

      • Kyle: Simple/dumb data hanging off the containers

    • Braden: Switch on a per-course basis?

      • Yes, course waffle flags maybe?

        • But it would be bad to do this on a per-user basis lol

    • Braden: Is the split shim readonly or read/write?

      • Dave: likes readonly

      • consensus on split-shim being read-only

    • Management command to convert.

    • Static assets?

      • Dave: not as important, but should be much simpler. Would like to get this in for MVP

    • Kyle: Question of what data models for course looks like?

      • the container that is the course outline, “supersection”

        • does this enforce 3-level hierarchy, or is it more flexible?

          • Braden: lean towards not enforcing the three-level hierarchy

            • Consensus on this.

      • the thing that covers all the other course stuff–policy, textbooks, assets, etc. Different models, combine to form the Learning Context.

      • Could have a Course model with ^ and a linked OutlineRoot model that has child Sections (or Subsections etc if not a 3-layer course)

    • Braden: We have punted on randomized containers

      • Let’s get Split Test Block working

  • What/how we want to present

    • The MVP

      • Mgmt command to convert Split course into LC

      • Readonly shim of Split for LMS (@Dave Ormsbee (Axim) )

      • OutlineRoot (openedx-learning) and CourseRun (edx-platform) models (@Braden MacDonald )

      • Static files

      • SplitTestBlock (@Kyle McCormick)

      • (developer mode)

    •  

2025-04-02

  • Pruning

    • We want to be able to prune a PEV. What is the implication on DraftChangeLogRecord and DraftSideEffect?

      1. we must copy all data from pruned PEVs to the change records

      2. give up on pruning. need to sketch out numbers, but the worst offender in modulestore was rewriting of structure docs and static assets

        1. cost of static assets will be much less

        2.  

      3. hybrid pruning: leave the PEV, but delete the things that hang off of it. publishing would need to be able to have a flag for pruning

      4. deeper pruning:

        1. PublishLog has to stick around, but DraftChangeLog gets less useful over time.

  • Dave: Proposal for field data storage as a separate model that hangs off of ComponentVersion (maybe PublishableEntityVersion) and has separate fields for settings vs. definition scopes.

  • class ComponentVersionXBlockData(Model): cv = OneToOneField(ComponentVersion) content_fields = JSONField() settings_fields = JSONField() # OR class ComponentVersionXBlockData(Model): cv = OneToOneField(ComponentVersion) fields = JSONField() # OR class ComponentVersionContentData(Model): cv = OneToOneField(ComponentVersion) fields = JSONField() class ComponentVersionSettingsData(Model): cv = OneToOneField(ComponentVersion) fields = JSONField()
  •  

  •  

    • More optimized for pulling an entire component’s data and/or pulling many components' data

    • Dave and Kyle agree on the desirability of this one, given that we want block-specific granular models to be built out

  • Kyle (counter-proposal):

  • class ComponentVersionXBlockData(Model): cv = FK(ComponentVersion) field_name = CharField() field_value = JSONField() is_content_scope = BooleanField() class Meta: unique_together = (("cv", "field_name"))
    • More optimized for granular acces

    • encroaches on a territory we’d like to see for specific xblocks

      • class ProblemBlockData(XBlockDataModelBase): cv = FK(ComponentVersion) #... etc class ProblemBlockResponse(Model): problem_block = FK(ProblemBlockDatha) ...
  •  

2025-03-05

  • [kyle] Current thinking: after containers land in libraries, open a prototype branch where we try to get some course content, getting a hacky course running under Learning Core, e.g. single section → subsection → unit.

    • Still not clear where the switching layer would be. Current thought: Ignore modulestore for now.

    • Tentative prototype:

      • new table in the django admin interface-course run, title, start date, end date, supersection fkey, and that’s from a learning context--initial editing in the library

      • for backwards compatibility, just use existing keys, i.e. “course-v1”

      • new course run table point to Container or Supersection? (kyle: suggestion for a legacy course run shim that would inject the hierarchy expected)

      • collection for other associated content that’s not hanging off the root?

  • [kyle] Should record decisions that deprecate XBlock capabilities, pave way to remove student_view of sequential

2025-02-05

  • [kyle] start on prototype as mentioned below.

2025-01-09

  • [kyle] I would like to get going on a course-Components-in-LC prototype. Anything to know before I dive in?

    • What are the areas of risk worth focusing on?

      • Just getting my hands dirty with LC

      • APIs are pretty low-level, have been adding new convenience method, would like to add more of those. Could PR these in separate from the POC

    • Trying entirely using the new LearningCoreRuntime for verticals and see what breaks

    • Serving content outside the tree…. ?

  • [dave] taking a stab at:

    • intermediate ADR for serialization of library format

    • bringing courses that already exist in split into content libraries



2024-12-18

  • Talk is accepted!

  • Last time: modelling dynamic containers

  •  

  •  

  • Student State

    • CSM is too big to modify

    • New tables, 1-1 with those, gradually pull data over

    • K: If we have a new course key or a way of showing a course is entirely built with new data models, could student state go directly into those without double lookup?

      • Depends on how much backcompat we want

      • There is technically a student state client

      • But lots of plugin code pulls in CSM models and just uses those

      • Analytics too - data package exports

    • We are likely stuck with CSM for a while, and may need to even push CSM updates up to the new models

      • We could cut off CSM usage eventually, but it will take time

    • K: Courses v2 could have a separate place for student state, while v1 courses are still CSM backed

      • We would want authors to be able to port their courses forward and back and forth v1<->v2

    • Do we need to migrate off of CSM?

      • Moving content to LC is a precursor to moving student state off of CSM

      • One of the big advantages to LC is that storing state becomes much more efficient

        • Because of foreign keys to content rather than varchar keys – much smaller indexes

        • Also ability to reference versions is useful

        • But as long as we’re maintaining write compatibility (i.e. writing out to CSM as well), we don’t get these benefits.

    • Paths:

      • 0. Keep super-complete compatibility by writing CSM updates up to LC

      • 1. Keep complete compatibility by writing LC updates down to CSM

        • This would break plugins that write directly to CSM

      • 2. Keep CSM as a read-only fallback if things aren’t found in new table

        • This would break plugins that read directly from CSMapis

        • Maybe ProxyModel to maintain the API interface that plugins expect, without actually storing things in CSM?

      • 3. Kill CSM altogether (long term goal)

    • Handful of other tables also have a lot of student state references to content, e.g. grades storage, completion.

 

Old Notes

Talk Proposal

Title

TBD

Description (<500 words)

TBD

Type

45 minute Talk

Target Audience

Open edX developers. Assumes familiarity with core Open edX authoring and learning features, including XBlock. Assumes general knowledge of Python, Django, SQL.

Proposal

TBD

Rough Talk Outline

  • What are ModuleStore, ContentStore? How do we store content in them today? Why did we do it like that?

  • What are the pitfalls of our current system?

    • Mongo required → more overhead for site operators, more upgrades for maintainers to do.

    • Tightly coupled to other edx-platform code → bugs and complexity for core developers.

    • No relational features like foreign keys, joins, constraints, etc. → fewer correctness guarantees and worse performance.

    • Assumes everything is a course or part of a course → either need to hack around that (libraries v1) or sacrifice potentially interesting features (standalone units)

    • Huge structure document → slow reads

    • Full course traversal required for many computations → limits on what we can efficiently query about any given piece of content

    • Assumes XBlock at every level → Features at any level of the content hierarchy are shoe-horned into the XBlock view of the world, which in some ways is too constraining (assumes Python, assumes in-process, etc), sometimes not constraining enough (always able to define get_children, access control, etc. using completely arbitrary Python), and often both at the the same time.

    • XBlocks can do anything and everything → variety of problems listed above

  • How are we working around the pitfalls today?

    • Block Transformers → More efficient reads at any level of the course

      • Downside: ModuleStore wackiness * Graph Theory → Sad Developers

      • Downside: Parallel implementations of content sequencing and access control → Bugs!

      • Downside: Still no relational guarantees

    • Learning Sequences → Super efficient reads and relational guarantees for sections and subsections

      • Downside: Does not go below the subsection level

      • Downside: Parallel implementations of content sequencing and access control → Bugs!

      • Downside: Learning-only. Cannot be used for authoring.

    • Be very very careful → Only handful of developers trust themselves to work in code that touches ModuleStore and ContentStore. We mostly don’t change it. When we do, we proceed very cautiously and slowly, but still break things sometimes.

  • What’s Learning Core, how does it address these pitfalls, and how are we using it today?

    • Ground-up replatforming of Open edX learning content

    • (Explain more)

    • (Explain how handles the pitfalls above)

    • (Explain Content Libraries V2)

  • How will we migrate content?

    • Concept: Slicing the problem Horizontally vs Vertically

    • Vertical method:

      • Make a minimal top-to-bottom (course-to-component) working reimplementation of courses in Learning Core. OK if some features are missing.

      • Allow experimental migration of courses which match that constrained feature set.

      • Iteratively add features over time, allowing migration of more and more courses

      • Finally, once the desired features are all migrated, deprecate the remaining features and remove ModuleStore & ContentStore

      • (This is how we did the Learning MFE migration)

    • Horizontal method:

      • Choose a feature of courseware / level of the course try, and reimplement it fully for all courses.

      • For example: migrate files & uploads, then components, then units, then sections & subsections, then courses.

    • Hybrid method (planned as of time of writing):

      • Use horizontal method for certain low-hanging features (files & uploads, components)

      • Use vertical method for remaining features (unit and above).

    •  

Additional Notes

TBD

2024-11-20

  • Sumac

    • Learning Core is getting rolled out to prod for the first time

    • LC Components, Collections, Assets for Libraries v2

  • Teak

    • LC Units, Sections in Subsections for Libraries v2

    • Migration path for Libraries v1->v2

  • Ulmo

    • Remove Libraries v1

    • Standalone assets (files and uploads) for Libraries v2

    • Standalone assets (files and uploads) for Courses

      • LearningPackage for each course

      • LearingCoreContentstore

      • Slow, major migration (~1TB for 2U?)

      • Rollout questions… all-at-once, or mixed mode?

      • Benefits of asset conversion would be:

        • Cheaper storage

        • Faster/easier querying (particularly if we hook it up to a search backend).

        • May cause higher latency if we maintain the same URLs

  • Verawood

    • LC Components for Libraries

      • Components with children?

        • Some in edx-platform

          • a/b test, randomize, etc

          • Each of these is a Selector

            • Selected content is a Variant (some list of PubEnts)

            • selector.get_or_create_variant_for_user(args) -> Variant

            • Is this a generalization of learning_sequences?

            • Kinda but not really. LSeqs is built as a pipeline of processors which can hide or remove content. Intersection of the resulting sets is the users' outline

        • Some outside? Do we support? Or deprecate customization of get_children ?

          • Do not break callers of get_children

          • Do deprecate the ability to customize it, though

          • End result is that get_children is the responsibility of the runtime / edx-platform:

          • class XBlock: def get_children(...): yield from self.runtime.get_children(self.usage_key)
      • Ideas

        • Just migrate leaf components

          • get_item(leaf_block) → LC
            get_item(parent_block) → splitmongo

        • Implement Unit and below at LC level

          •  

        • Thing to watch out for: how to juggle the two different runtimes used for field data persistence.

  • Misc

    • Eventually

      • Leaf blocks can define views with arbitrary python

      • Parent blocks (containers?) are declarative, an external system looks at the rules they declare to determine the course tree

2024-11-13

Background assumption:

  •  

  • Remaining shims?

    • Some layer of basic shimming

      • LearningCoreModuleStore - thinnish shim layer

        • 80/20

  • At least one overlap release

    • Progressive (course-at-a-time) cutover

  • Long term: No Mongo

    • Remove Mongo first? (via SplitDjangoModuleStore)

      • Issues: Latency of S3 is much less predictable than Mongo

      • Issues: Length of data migration

        • Because V1 content libraries queries the structure document at different versions, we’d need to move a ton of structure docs over

          • But, v1 content libraries will be gone by Ulmo

      • Issue: We are still reading CourseBlocks (the root ones) from Old Mongo

    • Current state

      • Active Versions is read from and wrote to both Mongo and MySQL

        • MySQL is backfilled

        • Pruning would need to be ported over

          • Latency is a worry here. We could do more caching, but it would increase the memory footprint.

      • Structure docs are in Mongo

      • Definition docs are in Mongo

    • Or remove Mongo along with ModuleStore removal?

  • Standalone items:

    • Files and uploads

      • We can emulate these API promises fairly easily.

      •  

    • Vertical or horizontal migration?

      • Vertical: Top-down, one entire course at a time

      • Horizontal: Components all at a time, units all at a time, etc. up the tree

        • Example: Components become backed by Learning Core. modulestore still exists, but get_item delegates to LC when it’s a component.

  • What can be broken? Talk to Jenna. For example:

    • inheriting defaults

    • FBE

      •  

    • No breakage > Intentional breakage with DEPR > unintentional breakage

  • Multiple course runs in the same learning package?

    • This would be ideal.

  • Prototype components in Learning Core

  • Transactions.

    • Mongo commits everything immediately

    • MySQL commits it at the end of the request

  • CourseOverviews, Block Transformers, Learning Sequences

Related content