Open edX Core - Arch Sync Notes

Open edX Core - Arch Sync Notes

Team: @Kyle McCormick @Dave Ormsbee (Axim) @Braden MacDonald

Epic: https://github.com/openedx/openedx-learning/issues/353

TOC:

Jun 9, 2026

May 5, 2026

  • Next steps between now and verawood.1

    • [Dave] Course import into a library is painfully slow. The current pattern is to add and publish components one by one, meaning that even if events are processed on a PublishLog basis, we’re doing hundreds to thousands of them per import. Can we draft all the changes first and then publish all at once?

      • [Braden] PR: https://github.com/openedx/openedx-platform/pull/38508 Does that speed it up?

        • [Dave] It speeds up the actual database part of it, but it’s the meilisearch re-indexing that’s glacial. It looks like each block gets its own task. I guess ideally, we’d want to fire off the search updates in bulk based on the DraftChangeLog/PublishLog events.

          • [Braden] Thought that the modulestore side was slower than the meili side

          • [Dave] Not sure

        • [Dave] Also, this reminds me that we should have a .drafts property on DraftChangeLog so we can do something like publish_from_drafts(lp.id, change_log.drafts). “Publish the DraftChangeLog I just made” has to be a really common use case. Maybe its own function entirely.

        • [Dave] This is the top priority bugfix

        • [Braden] Will take a look as priority

    • [Braden] https://github.com/openedx/openedx-platform/issues/38507 - fix for Verawood?

      • We should do a targeted fix for this

    • [Kyle] PRs that didn’t make it:

      • https://github.com/openedx/openedx-core/pull/580 (“breaking change” but honestly it’s a bugfix)

        • Try to fix this after modulestore_migrator is resolved

        • Backport it, time permitting

        • master PR is all ready to go

        • Braden will do this one

      • https://github.com/openedx/openedx-core/pull/573 (breaking change, although it guards against several buggy ways to call the API)

        • Not this one - punt to Willow

        • Good for DevX and for avoiding buggy usage of the API, but too breaking to backport right now

      • https://github.com/openedx/openedx-core/pull/564 (non-breaking change, just nice to have)

        • Not this one - punt to Willow

        • Very nice both for DevX and for performance but not critical

      • Options for each one:

        • Backport into 1.x branch and install that branch into verawood

        • Put it aside for now, come back to it after the conference for Willow

  • [Dave] Generally would should be taking a look at performance

  • What’s the release date?

    • Jun 23, 2026

    • Peformance and docs stuff will mostly land post-conference

  • [Dave] Do we want to cut a 1.x branch? I’m concerned about the unintended breakage that can ripple out to platform, particularly if we do anything that might require data migration. Even going from 0.47 to 0.48 to 1.0 caused unexpected problems.

    • [Kyle] I think this is a great idea

    • [Kyle] Created verawood-backports, to be versioned 1.0.2, 1.0.3, etc.

      • main will start with 1.1.0/ 2.0.0 and continue from there

  • [Dave] Can we use some of Braden’s remaining hours to explore Courses-on-Core work?

    • [Kyle] Yes 100%

    • Braden will use remaining time on this

  • Kyle/Dave will talk about resourcing for Courses-In-Core

  • [Dave] I’ve been doing some docs work with Claude in the background

    • So nice

    • rst formatting in docstrings

    • Currently in draft, big

    • Docstrings need updates - will do separate

    • Will do a similar one on the API side later

  • timeline

    • Kyle mostly focused on conference for next two weeks

    • Braden away during conference and until June 9th ish

    • Kyle away after conference until June 9th ish, then will be focused on docs and performance

Apr 28, 2026

Apr 21, 2026

  • Release week

  • Epic: https://github.com/openedx/openedx-core/issues/353

  • @Kyle McCormick List of PRs to land before the release

  • Learning package delete event?

    • there’s a library delete event, but you can delete a library w/o deleting the learning package

    • restore package makes a bunch of learning package which are “orphaned” until hooked up to a library. don’t think we ever wrote code to delete the ones that don’t get promoted into being libraries

    • we definitely want to be able to delete learning packages

    • yes to LEARNING_PACKAGE_DELETED event

    • does that ^ trigger COLLECTION_DELETED and ENTITY_DELETED ? ENTITY_REMOVED_FROM_COLLECTION ?

      • Dave: we should probably just have LEARNING_PACKAGE_DELETED, not triggering the others

      • Braden: we’d need to figure out which entities and collections to delete from meili based on LEARNING_PACKAGE_DELETED

      • Kyle: does COLLECTION_DELETED trigger ENTITY_REMOVED_FROM_COLLECTION ?

        • Braden: COLLECTION_UPDATED handles all of these

    • should be fine to just raise LEARNING_PACKAGE_DELETED , and not individual ENTITY_DELETED events, since it’s not undoable, and handling it just means wiping out all associated data

    • Braden will need to figure out relationship between LEARNING_PACKAGE_DELETED and CONTENT_LIBRARY_DELETED

  • https://github.com/openedx/openedx-core/pull/554

    • This PR is just restore, not backup

    • Does not break backup_restore python API used by platform

    • New set of tests

    • Currently

      • extract from archive ← schema exists.

      • validation ← just using ootb pydantic validation. error reporting not working.

      • load into learning packages ←

    • Reviewable EOD tomorrow?

    • Current philosophy is “don’t be permissive”

      • but it’s more permissive in that you can be missing data/fields, or you can add extra fields

      • it is not permissive in that if you have a broken entity import, it blows up the whole import

        • i.e. no partially successful import

          • would be nice to have a list of what went wrong

      • regardless, error messages will be linked back to source toml file

    • Is this a public API?

      • no

      • Only difference is that if Willow+ archives add fields, this would be more tolerant

    • does it allow multiple entities in one toml file?

      • no

      • would be reluctant to without declaring a v2

    •  

Apr 14, 2026

  • ~1.5 weeks out from Verawood cutoff

  • @Dave Ormsbee (Axim) Proposed high level structure and abstractions for validation/restore flow:

    • (Zip Archive | Local Path | GitHub | etc.) → Storage Filesystem (fsspec)

    • Extract TOML from fs and compile into UnvalidatedCompletePackageInput

      • Rationale: Allow a way to have input formats that are better tailored to specific use cases, e.g. editing entire sections in the same file.

      • Contents

        • Unvalidated Learning Package dict (simple Python types)

          • Includes all TOML content (entities, versions, containers, collections), but does not include media files like block.xml or static assets.

          • This is meant to be a minimal transformation of the input TOML.

          • Structured to try to prevent certain types of logical errors, e.g. refs are used as keys in a dict, so there’s no way to represent duplicate definition of entities.

        • Errors

          • Mostly consistency errors, e.g. redefining the same entity multiple times.

          • Not the actual JSON Schema validation, but just errors when converting the source TOML into the compiled dict.

        • Resources (where to find media data for later)

    • UnvalidatedCompletePackageInputCompletePackageInput + errors

      • This is done using Pydantic models.

      • Input and output models will be kept separate (output models are much stricter about requiring certain fields).

      • JSON Schema is generated from the input model.

      • Two levels of errors:

        • Ones that JSON Schema can handle, e.g. missing fields, regex not matching, wrong types, etc.

        • Deeper ones that JSON Schema can’t deal with, like referential integrity (pointers to versions that don’t exist, containers referencing children that don’t exist, etc.)

        • Missing resources

      • Strict mode?

    • CompletePackageInputLearningPackage

    • Jesper: Having documentation of the end-to-end restore pipeline will be good

      • @Jesper Hodge Did I get that right ^ ?

    • Braden: We have many things: tar.gzs, xml, zips, json, we want to have a sqlite format…

      • Data researchers would probably want to have an ability to import openedx archives with a specific library - do we want this as a separate opendx_data library ? Probably not necessary…

    • Kyle: could this be modified to be update instead of create?

      • Dave: trickiest thing is version numbering

        • imagine that the archive specified v2 of a compnent, but you also have a v2

        • Jesper: when importing a version, it should always become the highest (newest) version

      • Kyle: concerned about the idea of having separate formats for full restore vs. partial import

    • Braden: stagedcontent

      • should this new format be used to represent stagedcontent

      • would be great if we could represent everything in a library as a file, just like we can with OLX today, enabling things like copy-paste and drag-and-drop UIs

    •  

Apr 7, 2026

  • ~2.5 weeks out from Verawood cutoff

  • @Kyle McCormick Whiteboarding - “north star” architecture https://excalidraw.com/#room=9e33ec3e3ebf9175de2b,nAqtkyhx59SUYlEl9a-XiQ

    •  

  • @Kyle McCormick @Braden MacDonald - Confirm we want LEARNING_PACKAGES_* events (https://github.com/openedx/openedx-core/issues/462#issuecomment-4193595258 )

  • @Kyle McCormick @Dave Ormsbee (Axim) Met with MITx physics course author teams, whose use OLX heavily.

    • They have several repos, each holds 1 or more related courses

      • possible argument for learning packages holding multiple courses / course runs

    • Automated workflow: merge to master triggers the XML to import into their staging env

      • Last-minute tweaks may be made in studio, but are wiped out upon next course update from git. Ad hoc process for remembering to make fixes back in XML.

    • Each section is an XML file, holds structure down to unit or component level

      • toml does not support multiple levels in one file. seems like it’d be easy to support that, though?

    • Many units are authored in .tex files and converted in unit XML via latex2edx

      • does this imply that units should be able to hold assets?

  • Decision: For Verawood, just use pydantic to validate and document the current format. Worry about pluggability later, as we’re considering not even sticking with TOML (sqllite? more OLX?) long term.

  • @Braden MacDonald opened a PR to make type annotations for primary keys.

Mar 31, 2026

  • ~3.5 weeks out from Verawood cutoff

  • @Dave Ormsbee (Axim) : Does enrollment get its own top level app in openedx-core? (In the context of LearningPathway enrollments.)

  • @Kyle McCormick Sample plugin ideas:

    • Feanil and I are giving a conference workshop on how to build multiple kinds of plugins into one unified omni-plugin https://github.com/openedx/sample-plugin . Would like to get some openedx-core “plugin” representation in there. We’re thinking,

      • Course card archival

      • “Reviewed by ___”

        • model: ReviewedStatus(TimeStamped)

          • PE

          • DraftChangeLogEntry

          • User

        • rest api for marking as reviewed

        • (new?) Sidebar slot

        • new Filter: EntityPrePublish (or model pre-save signal)

          • be careful with PublishLog

          • should it remove things from the publish list, or cancel the whole publish?

            • just abort it - removing things would be full of footguns

            • removing things would have to happen at an earlier layer in order to be safer

        • ambitious: PublishReviewedItems

          • get_entities_with_unpublished_drafts

            • no dependencies - just actually reviewed things

Mar 24, 2026

  • ~4.5 weeks out from Verawood cutoff

Mar 17, 2026

  • @Kyle McCormickKey Coherency for openedx-core v1.0

  • @Kyle McCormick I’d like to revisit Braden’s Version branching proposal one more time, and decide if there’s anything we’d like to do before minting v1.0.

    • Alternate proposal from @Braden MacDonald :

      • What if we try for a more git-like model where our *Version models (e.g. ComponentVersion, PublishableEntityVersion) no longer point back to their unversioned model (Component, PublishableEntity), but instead just point to their previous *Version. Then, we can have multiple instances of a Component (with different keys) pointing at the same data (same *Version). They would have the same history, but if you modify one it would “fork” the versions, and diverge from the other.

        • Example:

          • Component Text1 exists, pointing to ComponentVersion “anonv1”. You edit it, creating ComponentVersion “anonv2”, previous version “anonv1”. Component Text1’s Draft pointer points to “anonv2” and published points to nothing.

          • Now you duplicate it (or re-run the course or make a CCX variant, or whatever). Component Text2 now exists (different key, same learning package) and its Draft pointer also points to “anonv2”. Now both Text1 and Text2 share the same data and same version history, but they will diverge if you edit them.

      • When we create a new course run from an existing course run, we would have to duplicate all the Components and Containers, but leave the *Version models alone. This makes it a much more efficient operation, because most of the data is in the *Version models, and it also means the full history is preserved.

      • As for keys, let’s say our CourseRun model has a “key_prefix” (or branch). Then if we have a library, course A run 2023, and course A run 2024 all in the same learning package, the library’s keys can be unprefixed, and the two course runs can have prefixes like “A2023:” and “A2024:” so that there are no collisions among the library and course runs - each has its own namespace without affecting the data model. (Of course you can just as easily make a namespace string field / db column on PublishableEntity if you want this to be more formal.) Example: opaque key “course-v2:org+A+2023:block:problem:p1” maps to PublishableEntity key “A2023:problemp1” and and the exact same component in the other run with opaque key “course-v2:org+A+2024:block:problem:p1” maps to “A2024:problemp1”

      • Upsides of this approach: very fast, has nice copy-on-write semantics, separate namespaces per run, much simpler/cleaner history.

      • Downsides: pretty significant change to the data model, ComponentVersions would not be deleted when the Component is deleted (occasional cleanup process required if you want to delete them), version numbers would be incrementing in jumps based on the highest version used in the learning package not the higher version used for that specific component.

Mar 11, 2026

  • @Braden MacDonald - containers

    • Most (all?) settings we have today for containers are really course policies.

Kyle McCormick
July 22, 2025

@Dave Ormsbee (Axim) maybe something to talk through today?

Dave Ormsbee (Axim)
July 22, 2025

Sure thing.

Dave Ormsbee (Axim)
July 22, 2025

I added a few other things as possible agenda items. I do want to do a brief retro to make sure we capture notes to ourselves before we forget any more of the prototyping, but I don’t think that’ll take up our full time.