Content Libraries v1 Behavior / Storage Implementation

Context: This was written as a response to https://openedx.atlassian.net/browse/TNL-7454 , in which v1-library-sourced block settings were accidentally lost due to a change in their import handling.

The implementation of the original version of content libraries makes the most sense if you frame the development effort as, “What is the fastest way we can develop this, given the code that has already been written for courses?” After all, courses already stored, imported, exported, and rendered XBlocks. Content libraries tried to add just enough on top of those systems to get the intended effect. Which means that, under the covers, content libraries look very much like really small courses.

Library and Course Storage Format

Content Libraries and Courses both store all settings-scoped XBlock fields in structure documents in MongoDB, with each document holding all settings-scoped fields for the entire library or course run. These are stored in the modulestore.structures collection. Structure documents are immutable–each one represents a different version. Courses have two branches of structure documents: one for drafts and one for the published versions (used by Studio and LMS, respectively). Libraries only have one branch called “library”. Old historical structure documents that are no longer used by the Studio or LMS get periodically removed, to save storage space.

In addition to storing the value of all settings-scoped fields, the structure documents also store the IDs for definition documents for the content of each individual problem. The XBlock fields in question are:

Whenever a field is defined with scope=Scope.settings, it ends up in the giant structure document. When a field is defined with scope=Scope.content(e.g. problem text, list of inputs, correct answers–all of which are stored under the data field in ProblemBlock), it gets stored on a per-block/module basis in definition documents (modulestore.definitions collection in MongoDB). The original split between content and settings fields was intended to facilitate exactly this kind of re-use, where a piece of content is stored separately from the settings associated with its use in a particular course. Unfortunately, it’s not this clean in practice, because fields that were added later that probably should have been content-scoped ended up being settings-scoped instead (e.g. markdown, source_code, use_latex_compiler).

What happens when you add a LibraryContentBlock to a Course?

Say we have a Library with exactly one problem in it. When we add a LibraryContentBlock pointing to that library into a Course for the first time, it creates a new structure document in the draft branch of that Course. This new structure document has two new entries in its list of blocks: one for the library_content module itself, and one for the problem (ProblemBlock) from the Library.

The library_content module in the Course’s structure document looks like this:

{ "block_type" : "library_content", "edit_info" : { "original_usage_version" : null, "previous_version" : ObjectId("5f651b216cb7f7a0c2122b4f"), "update_version" : ObjectId("5f651b2e6cb7f7a0c2122b50"), "edited_by" : 2, "source_version" : ObjectId("5f651b216cb7f7a0c2122b4f"), "original_usage" : null, "edited_on" : ISODate("2020-09-18T20:40:14.230Z") }, "definition" : ObjectId("5f651b106cb7f7a0c2122b4d"), "asides" : { }, "fields" : { "source_library_id" : "library-v1:DaveX+2020-09-18", "children" : [ [ "problem", "e9e719868536e9418816" ] ], "source_library_version" : "5f651a466cb7f7a0c2122b34" }, "block_id" : "82dea980b8c443abbd99f7b588f769c5", "defaults" : { } }

The block_id maps to the last part of the UsageKey for the LibraryContentBlock. So this would entry would have a UsageKey of block-v1:DaveX+LibraryTesting+2020-09-18+type@library_content+block@82dea980b8c443abbd99f7b588f769c5 (deriving the first part from the course key).

The definition doesn’t have anything actually interesting in it, since LibraryContentBlock stores all its fields in the settings scope (again: only content-scoped fields end up in the definition documents).

The interesting bit here are the fields, which map to the settings scoped fields source_library_id and source_library_version which are also attributes we see in the OLX for this library. The source_library_version is the ObjectId (MongoDB identifier) for the structure document that represents the exact version of the Library that this LibraryContentBlock is referencing.

The children field has a list of all the problems that we’re using from the library. This maps to the OLX we see in the Course’s export:

<library_content source_library_id="library-v1:DaveX+2020-09-18" source_library_version="5f651a466cb7f7a0c2122b34"> <problem url_name="e9e719868536e9418816"/> </library_content>

The LibraryContentBlock is using the same sort of container-like mechanisms that Units (VerticalBlock) use to render themselves. When the LibraryContentBlock is asked to render its student_view to display to students in the LMS, it’s going to check its settings and the user’s state and decide to render some number of its child problem blocks based on that.

The XBlock runtime doesn’t know how to arrange it so that a block in one structure/course has child elements from other courses. It would be both a performance headache because of how large structure documents are, as well as a security headache because so much of our permissions are course/library based. So the problem block ID that this LibraryContentBlock is referencing is not the problem as it is stored in the source Library (where it has a block_id of 93120d90875545ff87da76fc2484f209). The LibraryContentBlock is making a child reference to the Course’s copy of the Library’s problem.

The problem data in the Library’s structure document looks like this:

{ "block_type" : "problem", "edit_info" : { "original_usage_version" : null, "previous_version" : ObjectId("5f6516636cb7f7a0c2122b30"), "update_version" : ObjectId("5f6516926cb7f7a0c2122b31"), "edited_by" : 2, "source_version" : null, "original_usage" : null, "edited_on" : ISODate("2020-09-18T20:20:34.123Z") }, "definition" : ObjectId("5f6516636cb7f7a0c2122b2f"), "asides" : { }, "fields" : { "attempts_before_showanswer_button" : 2, "weight" : 10, "showanswer" : "always", "display_name" : "Library Title for Problem", "markdown" : "You can use this template as a guide to the simple editor markdown and OLX markup to use for multiple choice problems. Edit this component to replace this template with your own assessment.\n\n>>Add the question text, or prompt, here. This text is required.||You can add an optional tip or note related to the prompt like this. <<\n\n( ) an incorrect answer\n(x) the correct answer\n( ) an incorrect answer\n", "max_attempts" : 5, "rerandomize" : "always" }, "block_id" : "93120d90875545ff87da76fc2484f209", "defaults" : { } }

The same problem data in Course’s structure document looks like this:

I won’t go into all the details, but some things worth noting:

Shared Definitions

The text of the problem and the input types, response types, etc. is not in either of these structure documents. It’s stored in the definition document, and the structure block entry for this problem in both the Library and the Course point to the same definition: ObjectId("5f6516636cb7f7a0c2122b2f")

Block ID (and UsageKey) Generation

The two block_id entries for this Problem are:

  • e9e719868536e9418816 – Course

  • 93120d90875545ff87da76fc2484f209 – Library

If they’re both machine generated, why is the Library’s Problem block_id longer? It looks like the Course version of the Problem gets its block_idset by hashing the Library’s source block_id and the Course’s destination LibraryContentBlock.

Fields vs. Overrides

The Library stores the settings scoped fields like display_nameand weight in its fields. The Course’s problem has nothing currently listed in its fields, aside from the empty children (ProblemBlock can never have children, it’s just that all XBlocks store that field, even if it’s always empty). Instead, the Course’s version of the problem copies all those Library-set fields into the Course’s problem’s defaultsdictionary. The fields in the Course version of the problem is reserved for overrides.

Overall, this is good news, because that means that this settings data exists in a way that’s associated with the course. As I mentioned before, this makes things much more predictable from a security and performance standpoint. Making this copy also means that the Course is better insulated from changes made to the Library, like it being deleted or having older versions pruned to save space.

Having a clear separation of things that were specified in the Library vs. overrides set by the Course also makes things much cleaner from a tracking/reasoning point of view in the data. So that’s great on that point as well.

There are a couple of important caveats though:

  1. Missing Markdown
    It can’t copy over the markdown for a problem because while that field is declared in the settings scope, it should be a content scope. In particular, editing it will try to update the value of the problem’s XML, which is stored as content scope. So you can edit it, but you can’t meaningfully override the setting without also editing the content–and remember, the Library and Course are still pointing to the same definition document. Making it so that the definitions are separable is a lot more work, so the hack was to always strip the value for markdown when copying into a course.

  2. Missing in OLX
    The default values in the Course’s version of the problem do not export in OLX. Only the overridden field values get exported. This is part of the reason why the import process was modified to refresh those values from the library. If you exported from one course and into another, there would be no other way to get at the Library-set defaults, because the OLX wouldn’t carry over that data. Unfortunately, this was kind of a giant hammer that always grabs the latest version from the library for that metadata. On the bright side, I believe that the giant hammer-that-ignores-versions means that if you copy the library to the new instance first, things should “work”.

Could we just always export the defaults as fields, to carry that data across different instances of Open edX? Maybe. I’ll write more about that in the “where do we go from here” part of this multi-post. But the complicating issues are that:

  • There is no native distinction in OLX between defaults and overrides.

  • There are import/export roundtrip scenarios that might lead to surprising issues/ambiguities (e.g. we may lose the ability to revert to library settings, which is currently a feature of ContentLibraryBlock).

  • There’s a weird interplay between defaults and inherited values. XBlocks have a notion of default values vs. what’s set on the XBlock explicitly, but the default values are almost always either inherited or coming from whatever the field default is declared to be in code. We special case InheritingFieldData to make the default come straight from the defaults part of the course’s structure document entry for any block contained inside a ContentLibraryBlock, instead of by the usual mechanisms. Which works, but it violates a layer of abstraction, as the storage mechanism is now using special handling for one specific XBlock type.

    • To illustrate the issue more concretely: We currently don’t export default values for XBlocks because the user never specified their values. If you don’t set an attribute in your OLX, it shouldn’t appear there when you export. If you only set something on a sequential, it shouldn’t be echoed down into every module underneath it on export. But because this defaulting mechanism was overloaded to pull in values from the Content Library’s version of the problem (by copying it to the defaults for the Course’s copy of the problem), we can’t just start exporting defaults without exporting a whole bunch of garbage that isn’t there today. So we’d have to special case it again on the export as well.

These things could, and maybe should be done. But anything that changes import/export serialization is not simple and we have to be especially careful about compatibility. Also, the goal here is not to completely fix the current implementation of Content Libraries. The two goals are to fix it for the two recent bugs that have come up, and make design notes so that we get this less wrong in the next version of the feature that is under development.