Blockstore Design
This is the most current Blockstore design document, but many details continue to be refined in conversations on the Issues page of the Blockstore repo. Some of these topics still under debate are:
- How granular is a Bundle in different use cases (i.e. single problem, entire unit, outline of entire course, etc.)?
- Exactly what files get placed where inside a Bundle?
- What does the import/export look like for courses and content libraries?
This is the design document for Blockstore, a system for authoring, discovering, and reusing educational content. Development is being funded by Harvard LabXchange and the Amgen Foundation, with significant in-kind contributions from edX.
Abstract
All lesson content in the Open edX platform is currently stored in the modulestore, which requires that all content is organized into “courses” that are each a directed acyclic graph (DAG) of XBlocks/XModules (or in “libraries” which are implemented in the same way as courses, but which have a shallower graph and support a limited set of content types).
This proposal outlines a design for a new service that stores content for the Open edX platform, called “Blockstore.” Blockstore is meant to be a lower-level service than the modulestore, and it is designed around the concept of storing small, reusable pieces of content, rather than large, fixed content structures such as courses. In other systems and academic contexts, these are often called “learning objects,” and Blockstore is thus a type of Learning Object Repository (LOR). For Open edX, Blockstore is designed to facilitate a much greater level of content re-use than is currently possible, enable new adaptive learning features, and enable delivery of learning content in new ways (not just large traditional courses).
Motivation
At its heart, edx-platform's current modulestore works with large, static course structures. Various dynamic courseware features such as A/B tests, cohorts, and randomized problem banks work around this by copying every piece of content that might be displayed to any user and then selectively showing a subset of that using permission access checks. When you use a randomized problem bank in a sequence, the system is in fact copying the entire content library into that sequence.
This poses a number of problems:
- It creates very large data structures, degrading courseware performance. Many common courseware interactions noticeably slow down as the amount of content in a course increases.
- The underlying structure is static, so the ordering of elements is fixed, making adaptive learning sequences extremely cumbersome to implement. Course teams have heroically worked around this using LTI hacks, using Open edX as both an LTI provider and consumer in chained LTI launches (sequences with one unit that acts as an LTI consumer to an adaptive engine interface that then becomes an LTI consumer for individual problems in the original course).
- Course content is largely duplicated for every run, making it cumbersome to manage across multiple runs, especially if those runs are on different instances of Open edX as is the case with some partners.
- Trying to work around these limitations and maintain performance has significantly complicated the codebase and slowed feature development. Content Libraries are far less powerful than they were intended to be because of the large infrastructure changes that would have been required to execute the original vision.
General Themes / Concepts
The high level ideas that ground this proposalare:
- Blockstore stores data in Content Bundles, which are a local grouping of files that Blockstore knows little about.
Blockstore doesn't understand much about the things inside of it. There is no special data structure within the core of Blockstore for Sequences vs. Units vs. anything else. OLX content, smaller assets like images, and larger assets like videos are all stored as files in Content Bundles, using conventions and groupings that make sense to the client application. A separate plugin layer will be able to listen to and take action for particular types of Content Bundles. - Blockstore is a lower level storage abstraction that XBlocks (and other clients) build upon.
We will compose Blockstore primitives in various ways to store content, but there isn't a 1:1 mapping of concepts. For instance, a Collection is not equivalent to a Course or a Library. A Collection might in fact store multiple Course Runs and multiple Library equivalents. A Content Bundle might be used to store a Sequence, an individual Problem, or the outline of a Course Run. The concrete primitives that Blockstore offers are versioned storage and the ability to access files in other Bundles using Links. This gives us a lot of flexibility, but requires us to be disciplined about how we use it. - Blockstore represents author intent and grouping. It favors author-friendliness even if it makes certain bookkeeping harder.
A Content Bundle in Blockstore is something an author wants to edit, version, import, and export as a single thing. That means a Bundle can be a single problem or an entire sequence. Things stored in Blockstore are not read-optimized, and are not the data structure that students interact with in the end. The definitions of a mostly static learning sequence and a learning sequence with an adaptive component might look completely different when stored in Blockstore, even if the Learner eventually experiences them in a similar way. The imported and exported bundle that is a Content Bundle should be as author-friendly as possible – assets are grouped together with where they're used, and as few Blockstore concepts as possible should leak into how the content is written. - Versioned content is the core of Blockstore, and plugin extensibility is focused around annotating that content.
Things that create, transform, update, or execute content live outside Blockstore. Plugins know when content has been changed, but they don't modify content. Plugins maintain their own data and APIs. Plugin data changes can happen outside of the lifecycle of the content itself. This means that an export of the same version of a Content Bundle will always yield the same authored content, but may yield different plugin metadata (example: new tags that were added). Also, versions are meaningful, and not every edit of every file spawns a new version. A version is like a "commit" in that sense.
Layers
Name | Responsibilities |
---|---|
Core | This is the storage of the content itself, and essential mechanisms for updating it. This layer doesn't understand anything about the actual files in the Content Bundle.
|
Persistence / BundleDataStore | Low level swappable piece that determines how we store the files for bundles.
|
Plugin | This is the more extensible layer that manages the constellation of metadata about a Bundle. It should be easy to add these over time, possibly in a separate repo. This layer actually does understand the contents, and might subscribe to events for particular content types.
|
Execution (external process) | The XBlock runtime that actually executes content, probably at a Unit level. This lives in a separate process and needs to know how to grab content from Blockstore for the purpose of preview and authoring. |
Blockstore Core Layer Concepts
Term | Definition |
---|---|
Content Bundle | A group of files that are versioned together and can be accessed from other Content Bundles.
|
Bundle Version | An immutable snapshot of a Content Bundle.
|
Link | The method by which one Content Bundle references and uses content from another. Since a Content Bundle is a collection of files, this is conceptually a symlink between one Content Bundle Version and another, via a special named folder (e.g. `.blockstore/links/want_tutorial/videos/wand_demo.mp4`).
|
Version Range | A lot of annotation around a Content Bundle (e.g. tagging around teaching standards) is going to be about content that may change over time as new Versions are created. Associating them with the Content Bundle as a whole might be inaccurate. Associating them with specific Versions might be wasteful, particularly when the changes are relatively minor. Treating a Version Range as a first class concept would help to simplify data modeling in other parts of the system.
|
Draft | A mutable space for changes to be made before they are committed to a Bundle Version.
|
Collection | 1:M grouping of Content Bundles.
Questions:
|
Signal | We'll emit named Django signals for life cycle events around the Core layer, including:
|
Higher Level Concepts
These concepts exist for the systems that author content in Blockstore, but Blockstore itself is unaware of them.
Term | Definition |
---|---|
Learning Context | A grouping against which student state is stored.
Open Issues:
|
Block | The smallest piece of content that can be authored, this maps to a leaf-node XBlock ("Component" in Studio) in edx-platform today. |
Unit | An ordered list of Blocks that is the smallest chunk in which content can be consumed.
|
Sequence | Sequences are a linear group of Units.
|
Why Files?
This proposal leans on the file system metaphor more strongly than the initial proposal. Some advantages to this:
- We can unify content storage and lifecycle updates.
Studio currently stores authored course content in three different ways. Data that manifests as content and settings scoped XBlock field data are stored in the ModuleStore, as sort-of versioned documents in MongoDB. Smaller binary assets like images and PDF files are stored without versioning via the ContentStore interface, which writes to GridFS. For performance reasons, self-hosted videos are typically put in S3. Treating these in a more uniform way will help when building add-on functionality like search, tagging, and licensing. It should also lower cost and operational complexity. - It simplifies the overall design.
Blockstore offers the capability to store, version, and reuse content. The more it knows about the internals of that content, the more coupled it will be, making it hard to adapt and make changes. For example, adding a major feature like internationalization of OLX should require no changes to the core of Blockstore. Files also help with the partial sharing use cases (more on that in the Content Reuse section). - Assets can get surprisingly sophisticated.
Splitting the world into sophisticated XBlock-like data and simple binary blob assets is intuitive, but many assets are more sophisticated than they appear. A Video can be represented by a single mp4 file, or it can be a set of HLS files with different quality video encodings, multiple audio tracks, and subtitles for many languages. An ebook may come in three different formats, that are grouped, licensed, and updated in sync with each other. Even simple assets can become more complex when internationalization comes into play. - It leads to more author-friendly grouping for reuse.
OLX and static assets can live side by side in the same versioned Bundle, without requiring any external links. If there are precursor files like source LaTeX files that compile out to OLX, those can be stored in the same Bundle. They would be ignored by the XBlock runtime, but still be very valuable to store, version, and share. - We need a serialization format for import and export anyway.
Import and export are going to be a critical part of the supported workflow. This lets Blockstore be agnostic to OLX conventions that are handled at the XBlock runtime layer. The only file conventions it imposes are its own simple notions of Bundle metadata and Links.
Using Blockstore to Model Courseware
So what would the Content Bundles look like?
Content Bundle File Conventions
- Blockstore exposes Bundle-level metadata as a .blockstore directory.
- This folder a fully virtual folder (nothing actually exists there on S3, and it is optionally materialized on export)
- Metadata is exported as JSON files.
- All Links to other Content Bundles will be of the form links/{alias}
- Link mapping is stored in the .blockstore/info.json directory on export.
Standalone Problem (leaf-level XBlock, appropriate for problem banks)
quicksort_complexity/ # Actual definition of the OLX for the problem. problem.xml # Convention: Files in static/ will be available to the browser during execution. static/ diagram.png # Blockstore metadata .blockstore/ info.json # UUID, version, title, links, dependencies
Video (centrally managed, probably in a separate Collection)
the_perfect_egg/ # VideoModule OLX video.xml static/ hls/ playlist.m3u8 # Dozens of alternate languages and encodings would # be in the following directories: audio/ subs/ video/ # Blockstore metadata .blockstore/ info.json # UUID, version, title, links, dependencies
Static Sequence (either standalone pathway or part of a Course)
wand_tutorial/ sequence.xml links/ beginner_wand/video.xml # Link to another Bundle (mapping is in info.json) wand_safety/video.xml static/ ollivander.jpeg units/ construction.xml maintenance.xml materials.xml ownership.xml # Blockstore metadata .blockstore/ info.json # UUID, version, title, links, dependencies
Course Run Example
The core part of this would be some sort of document that has navigation information and provides pointers to all the different sequences that make up a course. For example, you could have a Content Bundle that defined an XML file that looks something like:
<course-navigation> <chapter name="Magic Basics"> <!-- Policy related information like deadlines should live in a separate file, to make reuse easier. --> <!-- It's possible that we could treat the Course as its own Context, but it might be nice to be able to manually specify what context each sequence should be considered a part of. --> <sequence src="links/magic_source/sequence.xml"/> <exam src="links/magic_ethics_exam/exam.xml"/> </chapter> <chapter name="Magic History"> <!-- etc. --> </chapter> </course-navigation>
Some opinions that this approach has:
- Sequences are separate entities, Chapters (and other hierarchy between Course and Sequences) are details of Course.
In this particular mindset, "chapters" don't really exist as a separate entity, but only as a navigational convenience of Course. Everything is sequences and things that point to sequences. If a Course wants a flat list of sequences or a hierarchy an extra level deep – that's purely a Course Navigation concern. Unlike today, you wouldn't expect to be able to call a student_view on a chapter (that doesn't really work well in practice, but the framework currently allows for it). - Course navigation isn't an XBlock, and might not even be OLX.
For content compatibility reasons, it's really important that the individual Units be OLX. Maybe the Sequences as well, though I think there's a decent case to be made either way there (and even if it is OLX, it doesn't necessarily have to be backed by an XBlock runtime). But the things-that-point-to-sequences don't necessarily need to be OLX. The data model for courses/chapters is a combination of a simple container hierarchy that can be translated to in a straightforward way, and a mess of global policy attributes that need to be moved out. It should have some reasonably human-readable serialized format, but whether that's XML, JSON, YAML, or something else that better fits our "X + overrides" use cases - This is not the LMS data representation.
The way sequences are referenced here would be terribly inefficient if we had to make read calls to each sequence to get the titles to display. When we actually get to the representation in the LMS, we might want a uniform way to specify sequences that enables much more dynamic behavior by relying on queries of sequence relationships. But Blockstore is for authoring, and surfacing whatever is simplest and most intuitive for that.
Content Reuse
Content Bundles are created with the intended granularity of reuse. If you intend to have a bunch of problems for a problem bank, a Bundle is an individual problem. If you have a sequence that can be reused in multiple contexts, then the Bundle is your entire sequence. To use either of these, you would make a Link to it and reference the appropriate linked OLX file from a file in your own Bundle (e.g. a CourseRun definition referencing the above Sequence).
Reusing Content Bundles in their entirety is intended to be the common case, but it's possible that someone will want to reuse a small part of an existing Bundle. In that case, you can still use Links and reference a file directly, such as an individual Unit or image that's part of a larger sequence.
Reuse as designed for by the original author is done by referencing Bundles, and reuse in ways permitted, but not designed for by the original author is done by referencing specific files within Bundles.
Implementation Details
A non-exhaustive stab at some of the key implementation issues.
Storage
We have to be able to handle files that run from very small to extremely large. The metadata for Bundles will use the Django ORM, but the storage interface for Bundle files will be a pluggable backend that supports at least S3 and the local filesystem. Using django-storages would be nice, but I'm not entirely sure it supports all the features we'd want (e.g. "Content-Disposition" header support so we could reuse the same blob of data with different names).
Git or Mercurial have come up as possible backing stores as well. The main reason I avoid them is because of unknown operational complexity. At the granularity of the separate repositories, even a straightforward port of edx.org content could yield something on the order of a million repositories, with over 50 TB of video data. We already manage that in S3, but we have no experience running git repositories at this scale. Simple hosted AWS's EFS (their hosted NFS solution) performs horribly for git workloads, so we'd have to manage it ourselves. Our usage of video means that git-lfs is likely a requirement. None of which is insurmountable, but it raises the uncertainty and costs, and we likely won't take advantage of sufficient features to justify it.
Storage in an S3 store could look like:
/{bundle_uuid}/data/{file named after hash}
All source data files are stored by hash, to allow for cheap renames across versions. URLs can be sent with custom content-disposition headers to enable browsers to download them with sensible filenames. Using the UUID as the start of the name makes us less likely to run into per-partition performance throttling from S3.- /{bundle_uuid}/versions/{version}/mapping.json
Q: What should be done in a situation where an asset was marked as public in an earlier version but private in a later version (or vice versa)?
Import/Export
Simple import and export of a single Bundle can be done with a tar.gz or zip file. But import and export that involves multiple Bundles at a time would benefit from a command line tool that could talk to our API – particularly if it wants to export a whole Course's worth of Bundles and preserve Links to other Bundles (sometimes even multiple versions of the same Bundles).
Scale and Schema
Current edx.org has approximately:
- ~7K courses
- ~5K content libraries
- ~40 TB of content data
- The vast majority of that is video at various encodings (including the raws)
- ~2 TB of non-video static assets
- ~400 GB of versioned XBlock content data
Our design goal would be to support a 100X increase in this, ideally without requiring partitioning, since that's not compatible with foreign key constraints. We've had operational experience with tables over 1B rows, but we probably don't want to push our design beyond that if possible.
Model | Rows for edx.org | Rows at 100X |
---|---|---|
Collection | ~10K | ~1M |
Bundle | ~1M (~20 per course, more for content libraries, plus each video) | ~100M |
BundleVersion | ~10M | ~1B |
Limits we're going to set for Links and Files:
- Max 100 Files per BundleVersion
- Max 2,000 total dependencies for a BundleVersion (including dependencies of dependencies)
- Chosen to accommodate the largest known courses.
If we assume these limits, then naively creating rows on a per-BundleVersion basis will quickly explode the tables for Links and Files beyond what we want. One approach around this is to more smartly collapse redundant information across Bundle Versions, and another is to take the data out of the database entirely and into the file store.
BundleVersionFiles
Access patterns:
- Common:
- Get URL for single file in a BundleVersion.
- Get names/URLs for all files for a given BundleVersion.
- Less frequent:
- What files changed in this BundleVersion?
- What is the history of this file across all BundleVersions?
The files for a given BundleVersion will be tracked using a summary JSON file per BundleVersion, stored by the file store interface (i.e. in S3). A drawback here is that it's not easy to track things at a per-file level. On the bright side, it's really simple to implement and understand.
Links
Link relationships are more complicated, because we expect to be able to query them in various ways:
- Common:
- What are all the Links that a given BundleVersion is using?
- What Linked Bundles have been updated (have newer Bundle Versions)?
- Can I add this Link without forming a cycle (needed anytime we add a Link)?
- Reporting/Notifications:
- What Bundles are using my Bundle?
- What Versions of my Bundle are being used?
- How many Bundles use my Bundle?
Avoiding Link Cycles
There are a few problems with cycles:
- Infinite recursion when following links. This can be worked around by keeping a followed list and being mindful of the possibility, but as an unusual edge case, people would probably not account for it.
- Infinite version bumping. Say there is a cycle between A and B. Then when B is updated, A will have the option of updating its Link to the new version of B. But doing that will bump the version of A, and B will be prompted to update it's Link to the new version of A.
Data Model
So with those constraints, the proposed design:
- per-BundleVersion file describing the entire set of Link dependencies (including dependency-of-dependencies)
- We'd probably want to cap it at some number of total dependencies, say 2,000 (some courses have over 500 videos).
- When one BundleVersion adds another, this file is the only thing that needs to be inspected, since we've captured all transitive dependencies.
- A table for Bundle Link relationships that has just enough information encoded in it to track basic usage and send notifications.
- (Borrowing Bundle ID, Latest Is Using, Lending Bundle ID)
- Does not encode the transitive dependencies.
- "Latest Is Using" means that the latest version of borrowing bundle is still using the lending bundle.
- So if I want to see what Bundles are using mine to see what needs notification, I query this table for Lending Bundle = My Bundle and LatestIsUsing == True.
- (Borrowing Bundle ID, Latest Is Using, Lending Bundle ID)
- Notifications and queryable usage at the Bundle level is in the relational database.
- Cycle prevention and full dependency expansion happens in a file at the BundleVersion level (stored alongside other BundleVersion data).
Example of what a summary file for a BundleVersion might look like:
{ // Metafile format version "_meta": { "version": 1 }, // Information about the BundleVersion "info": { "id": "HqzGTIKeTyWcFn0hWpXKBA", "version": "1", "title": "" }, // Map paths to source files "files": { "private": { "course.xml": "d41d8cd98f00b204e9800998ecf8427e" }, "public": { "syllabus.pdf": "94287380b7700b204e9800998ecf8421" } }, // Links to the Sequences we're using (after @ is version). In // the Bundle, these end up as sub-dirs of links/ "links": { "week_001": "47skF26fRayxQ73j48oFmA@1", "week_002": "rSTYnCDSSAi_uaxuDKfWYw@1", "week_003": "TWoW-EYESzK5bJlEiRx2yQ@1", "week_004": "kY6RjNJuSWuia9wUa3D1zg@1", "week_005": "MFHKfOP0Qeuuj-Y96Koi8g@1", "week_006": "Jd9_H8UVSbeDOTlWsFYgAA@1", "week_007": "mz_g-lJXQCCTLlm7ous-LA@1", "week_008": "1U0Wymd4TUSS4lGUgTPWCw@1", "week_009": "i7oIlRWRTGqY07f81Xbdpw@1", "week_010": "urcqDvH6QXqAhD_j-XwbwQ@1" }, // Each Link has associated dependencies for all the things it // depends on (including dependencies of dependencies). Our own // dependencies are the union of all our Link dependencies. // * It's ok if different Links require different Versions of // the same Bundle. // * No dependency can be added if any version of this Bundle // is listed as one of its dependencies. // // The goal is to make dependency calculation and cycle // detection very fast when trying to add or update a Link. "dependencies": { "47skF26fRayxQ73j48oFmA@1": [ "ruSUc2xjQESeyq0fcu0QRw@2", "rcHOkLaoSH-VmTspp1xahA@7", "xcdnuGWcT_WoN0ueyULCkA@10" ] // + a lot more } }
I'm a bit surprised by the choice of files for storing content. My 2 cents:
The data structure should be chosen by taking into consideration what kind of read/write access will be required. For instance, file systems are not good at answering the question "what is the most recent file in this folder" (they don't have an index on dates). And, it seems to me that we are frequently going to have to make such queries, for instance to get the latest version of a content element.
Filesystems are bad at searching: will we have to rely on a grep-like tool (i.e: slow) when searching for content?
Separating data in two different storage systems (filesystem and SQL db) requires some synchronization, which is a hard problem.
Is it going to be a requirement for large Open edX platforms to have an object storage platform like S3 to store assets? Filesystems are bad at storing many files per folder; does it mean that object storage is the only way to go?
Files don't have schema: one of the current major issues with xblocks is that they are extremely difficult to migrate, whenever their definition changes. Backward compatibility becomes very hard to maintain. Files have the same problem.
(I am very new to the blockstore discussion, so it's quite possible my comments are completely irrelevant )