Migrating Courses to Learning Core

@Kyle McCormick @Dave Ormsbee (Axim)

2024-12-18

  • Talk is accepted!

  • Last time: modelling dynamic containers

  •  

  •  

  • Student State

    • CSM is too big to modify

    • New tables, 1-1 with those, gradually pull data over

    • K: If we have a new course key or a way of showing a course is entirely built with new data models, could student state go directly into those without double lookup?

      • Depends on how much backcompat we want

      • There is technically a student state client

      • But lots of plugin code pulls in CSM models and just uses those

      • Analytics too - data package exports

    • We are likely stuck with CSM for a while, and may need to even push CSM updates up to the new models

      • We could cut off CSM usage eventually, but it will take time

    • K: Courses v2 could have a separate place for student state, while v1 courses are still CSM backed

      • We would want authors to be able to port their courses forward and back and forth v1<->v2

    • Do we need to migrate off of CSM?

      • Moving content to LC is a precursor to moving student state off of CSM

      • One of the big advantages to LC is that storing state becomes much more efficient

        • Because of foreign keys to content rather than varchar keys – much smaller indexes

        • Also ability to reference versions is useful

        • But as long as we’re maintaining write compatibility (i.e. writing out to CSM as well), we don’t get these benefits.

    • Paths:

      • 0. Keep super-complete compatibility by writing CSM updates up to LC

      • 1. Keep complete compatibility by writing LC updates down to CSM

        • This would break plugins that write directly to CSM

      • 2. Keep CSM as a read-only fallback if things aren’t found in new table

        • This would break plugins that read directly from CSMapis

        • Maybe ProxyModel to maintain the API interface that plugins expect, without actually storing things in CSM?

      • 3. Kill CSM altogether (long term goal)

    • Handful of other tables also have a lot of student state references to content, e.g. grades storage, completion.

 

Old Notes

Talk Proposal

Title

TBD

Description (<500 words)

TBD

Type

45 minute Talk

Target Audience

Open edX developers. Assumes familiarity with core Open edX authoring and learning features, including XBlock. Assumes general knowledge of Python, Django, SQL.

Proposal

TBD

Rough Talk Outline

  • What are ModuleStore, ContentStore? How do we store content in them today? Why did we do it like that?

  • What are the pitfalls of our current system?

    • Mongo required → more overhead for site operators, more upgrades for maintainers to do.

    • Tightly coupled to other edx-platform code → bugs and complexity for core developers.

    • No relational features like foreign keys, joins, constraints, etc. → fewer correctness guarantees and worse performance.

    • Assumes everything is a course or part of a course → either need to hack around that (libraries v1) or sacrifice potentially interesting features (standalone units)

    • Huge structure document → slow reads

    • Full course traversal required for many computations → limits on what we can efficiently query about any given piece of content

    • Assumes XBlock at every level → Features at any level of the content hierarchy are shoe-horned into the XBlock view of the world, which in some ways is too constraining (assumes Python, assumes in-process, etc), sometimes not constraining enough (always able to define get_children, access control, etc. using completely arbitrary Python), and often both at the the same time.

    • XBlocks can do anything and everything → variety of problems listed above

  • How are we working around the pitfalls today?

    • Block Transformers → More efficient reads at any level of the course

      • Downside: ModuleStore wackiness * Graph Theory → Sad Developers

      • Downside: Parallel implementations of content sequencing and access control → Bugs!

      • Downside: Still no relational guarantees

    • Learning Sequences → Super efficient reads and relational guarantees for sections and subsections

      • Downside: Does not go below the subsection level

      • Downside: Parallel implementations of content sequencing and access control → Bugs!

      • Downside: Learning-only. Cannot be used for authoring.

    • Be very very careful → Only handful of developers trust themselves to work in code that touches ModuleStore and ContentStore. We mostly don’t change it. When we do, we proceed very cautiously and slowly, but still break things sometimes.

  • What’s Learning Core, how does it address these pitfalls, and how are we using it today?

    • Ground-up replatforming of Open edX learning content

    • (Explain more)

    • (Explain how handles the pitfalls above)

    • (Explain Content Libraries V2)

  • How will we migrate content?

    • Concept: Slicing the problem Horizontally vs Vertically

    • Vertical method:

      • Make a minimal top-to-bottom (course-to-component) working reimplementation of courses in Learning Core. OK if some features are missing.

      • Allow experimental migration of courses which match that constrained feature set.

      • Iteratively add features over time, allowing migration of more and more courses

      • Finally, once the desired features are all migrated, deprecate the remaining features and remove ModuleStore & ContentStore

      • (This is how we did the Learning MFE migration)

    • Horizontal method:

      • Choose a feature of courseware / level of the course try, and reimplement it fully for all courses.

      • For example: migrate files & uploads, then components, then units, then sections & subsections, then courses.

    • Hybrid method (planned as of time of writing):

      • Use horizontal method for certain low-hanging features (files & uploads, components)

      • Use vertical method for remaining features (unit and above).

    •  

Additional Notes

TBD

2024-11-20

  • Sumac

    • Learning Core is getting rolled out to prod for the first time

    • LC Components, Collections, Assets for Libraries v2

  • Teak

    • LC Units, Sections in Subsections for Libraries v2

    • Migration path for Libraries v1->v2

  • Ulmo

    • Remove Libraries v1

    • Standalone assets (files and uploads) for Libraries v2

    • Standalone assets (files and uploads) for Courses

      • LearningPackage for each course

      • LearingCoreContentstore

      • Slow, major migration (~1TB for 2U?)

      • Rollout questions… all-at-once, or mixed mode?

      • Benefits of asset conversion would be:

        • Cheaper storage

        • Faster/easier querying (particularly if we hook it up to a search backend).

        • May cause higher latency if we maintain the same URLs

  • Verawood

    • LC Components for Libraries

      • Components with children?

        • Some in edx-platform

          • a/b test, randomize, etc

          • Each of these is a Selector

            • Selected content is a Variant (some list of PubEnts)

            • selector.get_or_create_variant_for_user(args) -> Variant

            • Is this a generalization of learning_sequences?

            • Kinda but not really. LSeqs is built as a pipeline of processors which can hide or remove content. Intersection of the resulting sets is the users' outline

        • Some outside? Do we support? Or deprecate customization of get_children ?

          • Do not break callers of get_children

          • Do deprecate the ability to customize it, though

          • End result is that get_children is the responsibility of the runtime / edx-platform:

          • class XBlock: def get_children(...): yield from self.runtime.get_children(self.usage_key)
      • Ideas

        • Just migrate leaf components

          • get_item(leaf_block) → LC
            get_item(parent_block) → splitmongo

        • Implement Unit and below at LC level

          •  

        • Thing to watch out for: how to juggle the two different runtimes used for field data persistence.

  • Misc

    • Eventually

      • Leaf blocks can define views with arbitrary python

      • Parent blocks (containers?) are declarative, an external system looks at the rules they declare to determine the course tree

2024-11-13

Background assumption:

  •  

  • Remaining shims?

    • Some layer of basic shimming

      • LearningCoreModuleStore - thinnish shim layer

        • 80/20

  • At least one overlap release

    • Progressive (course-at-a-time) cutover

  • Long term: No Mongo

    • Remove Mongo first? (via SplitDjangoModuleStore)

      • Issues: Latency of S3 is much less predictable than Mongo

      • Issues: Length of data migration

        • Because V1 content libraries queries the structure document at different versions, we’d need to move a ton of structure docs over

          • But, v1 content libraries will be gone by Ulmo

      • Issue: We are still reading CourseBlocks (the root ones) from Old Mongo

    • Current state

      • Active Versions is read from and wrote to both Mongo and MySQL

        • MySQL is backfilled

        • Pruning would need to be ported over

          • Latency is a worry here. We could do more caching, but it would increase the memory footprint.

      • Structure docs are in Mongo

      • Definition docs are in Mongo

    • Or remove Mongo along with ModuleStore removal?

  • Standalone items:

    • Files and uploads

      • We can emulate these API promises fairly easily.

      •  

    • Vertical or horizontal migration?

      • Vertical: Top-down, one entire course at a time

      • Horizontal: Components all at a time, units all at a time, etc. up the tree

        • Example: Components become backed by Learning Core. modulestore still exists, but get_item delegates to LC when it’s a component.

  • What can be broken? Talk to Jenna. For example:

    • inheriting defaults

    • FBE

      •  

    • No breakage > Intentional breakage with DEPR > unintentional breakage

  • Multiple course runs in the same learning package?

    • This would be ideal.

  • Prototype components in Learning Core

  • Transactions.

    • Mongo commits everything immediately

    • MySQL commits it at the end of the request

  • CourseOverviews, Block Transformers, Learning Sequences