This is a rough, speculative proposal for a direction the Blockstore implementation can take, and builds on the Harvard LabXchange Blockstore Proposal. There are going to be a lot of statements about what Blockstore is and isn't to establish the outline of this approach, but no part of this proposal has been accepted in any way.
Table of Contents |
---|
General Themes
The high level ideas that ground this proposal (in no particular order) are:
...
Blockstore is a lower-level system that enables the storage, versioning, tagging, and re-use of content. Much of the intelligence required to work with that content lives outside of Blockstore itself. So for instance, Blockstore has no concept of Units, Sequences, or XBlocks (side note: the name Blockstore makes a little less sense in this proposal), beyond the notion that there's a "type" called "olx/unit".
Layers
Name | Responsibilities |
---|---|
Core | This is the storage of the content itself, and essential mechanisms for updating it. This layer doesn't understand anything about the actual files in the Content Bundle.
|
Persistence | Low level swappable piece that determines how we store the files for bundles.
|
Plugin | This is the more extensible layer that manages the constellation of metadata about a Bundle. It should be easy to add these over time, possibly in a separate repo. This layer actually does understand the contents, and might subscribe to events for particular content types.
|
Execution (external process) | The XBlock runtime that actually executes content, probably at a Unit level. This lives in a separate process and needs to know how to grab content from Blockstore for the purpose of preview and authoring. |
Blockstore Core Layer Concepts
Term | Definition |
---|---|
Content Bundle | A group of files that are versioned together and can be accessed from other Content Bundles.
|
Bundle Version | An immutable snapshot of a Content Bundle.
|
Link | The method by which one Content Bundle references and uses content from another. Since a Content Bundle is a collection of files, this is conceptually a symlink between one Content Bundle Version and another, via a special named folder (e.g. `.blockstore/links/want_tutorial/videos/wand_demo.mp4`).
|
Version Range | A lot of annotation around a Content Bundle (e.g. tagging around teaching standards) is going to be about content that may change over time as new Versions are created. Associating them with the Content Bundle as a whole might be inaccurate. Associating them with specific Versions might be wasteful, particularly when the changes are relatively minor. Treating a Version Range as a first class concept would help to simplify data modeling in other parts of the system.
|
Draft | A mutable space for changes to be made before they are committed to a Bundle Version.
|
Collection | 1:M grouping of Content Bundles.
Questions:
|
Signal | We'll emit named Django signals for life cycle events around the Core layer, including:
|
Higher Level Concepts
These concepts exist for the systems that author content in Blockstore, but Blockstore itself is unaware of them.
Term | Definition |
---|---|
Learning Context | A grouping against which student state is stored.
Open Issues:
|
Block | The smallest piece of content that can be authored, this maps to a leaf-node XBlock ("Component" in Studio) in edx-platform today. |
Unit | An ordered list of Blocks that is the smallest chunk in which content can be consumed.
|
Sequence | Sequences are a linear group of Units.
|
Why Files?
This proposal leans on the file system metaphor more strongly than the initial proposal. Some advantages to this:
- We can unify content storage and lifecycle updates.
Studio currently stores authored course content in three different ways. Data that manifests as content and settings scoped XBlock field data are stored in the ModuleStore, as sort-of versioned documents in MongoDB. Smaller binary assets like images and PDF files are stored without versioning via the ContentStore interface, which writes to GridFS. For performance reasons, self-hosted videos are typically put in S3. Treating these in a more uniform way will help when building add-on functionality like search, tagging, and licensing. It should also lower cost and operational complexity. - It simplifies the overall design.
Blockstore offers the capability to store, version, and reuse content. The more it knows about the internals of that content, the more coupled it will be, making it hard to adapt and make changes. For example, adding a major feature like internationalization of OLX should require no changes to the core of Blockstore. Files also help with the partial sharing use cases (more on that in the Content Reuse section). - Assets can get surprisingly sophisticated.
Splitting the world into sophisticated XBlock-like data and simple binary blob assets is intuitive, but many assets are more sophisticated than they appear. A Video can be represented by a single mp4 file, or it can be a set of HLS files with different quality video encodings, multiple audio tracks, and subtitles for many languages. An ebook may come in three different formats, that are grouped, licensed, and updated in sync with each other. Even simple assets can become more complex when internationalization comes into play. - It leads to more author-friendly grouping for reuse.
OLX and static assets can live side by side in the same versioned Bundle, without requiring any external links. If there are precursor files like source LaTeX files that compile out to OLX, those can be stored in the same Bundle. They would be ignored by the XBlock runtime, but still be very valuable to store, version, and share. - We need a serialization format for import and export anyway.
Import and export are going to be a critical part of the supported workflow. This lets Blockstore be agnostic to OLX conventions that are handled at the XBlock runtime layer. The only file conventions it imposes are its own simple notions of Bundle metadata and Links.
Using Blockstore to Model Courseware
So what would the Content Bundles look like?
Content Bundle File Conventions
- Blockstore exposes Bundle-level metadata as a .blockstore directory.
- This folder a fully virtual folder (nothing actually exists there on S3, and it is optionally materialized on export)
- Metadata is exported as JSON files.
- All Links to other Content Bundles will be of the form .blockstore/links/{alias}
- Link mapping is stored in the .blockstore/links.json directory.
Standalone Problem (leaf-level XBlock, appropriate for problem banks)
Code Block |
---|
quicksort_complexity/ # Actual definition of the OLX for the problem. problem.xml # Convention: Files in static/ will be available to the browser during # execution. For security reasons, we wouldn't just serve them straight # from Blockstore, but the LMS will need to know what's okay to send to # the browser vs. what might have solutions encoded into it. static/ diagram.png # Blockstore metadata .blockstore/ info.json # UUID, instance, title links.json # Empty dictionary -- no links here |
Video (centrally managed, probably in a separate Collection)
Code Block |
---|
the_perfect_egg/ # VideoModule OLX video.xml static/ hls/ playlist.m3u8 # Dozens of alternate languages and encodings would # be in the following directories: audio/ subs/ video/ links/ wand_tutorial_part_1/video.xml wand_tutorial_part_2_feathers/video.xml # Blockstore metadata .blockstore/ info.json # UUID, instance, title links.json # Empty dictionary -- no links here |
Static Sequence (either standalone pathway or part of a Course)
Course Run Example
The core part of this would be some sort of document that has navigation information and provides pointers to all the different sequences that make up a course. For example, you could have a Content Bundle that defined an XML file that looks something like:
...
- Sequences are separate entities, Chapters (and other hierarchy between Course and Sequences) are details of Course.
In this particular mindset, "chapters" don't really exist as a separate entity, but only as a navigational convenience of Course. Everything is sequences and things that point to sequences. If a Course wants a flat list of sequences or a hierarchy an extra level deep – that's purely a Course Navigation concern. Unlike today, you wouldn't expect to be able to call a student_view on a chapter (that doesn't really work well in practice, but the framework currently allows for it). - Course navigation isn't an XBlock, and might not even be OLX.
For content compatibility reasons, it's really important that the individual Units be OLX. Maybe the Sequences as well, though I think there's a decent case to be made either way there (and even if it is OLX, it doesn't necessarily have to be backed by an XBlock runtime). But the things-that-point-to-sequences don't necessarily need to be OLX. The data model for courses/chapters is a combination of a simple container hierarchy that can be translated to in a straightforward way, and a mess of global policy attributes that need to be moved out. It should have some reasonably human-readable serialized format, but whether that's XML, JSON, YAML, or something else that better fits our "X + overrides" use cases - This is not the LMS data representation.
The way sequences are referenced here would be terribly inefficient if we had to make read calls to each sequence to get the titles to display. When we actually get to the representation in the LMS, we might want a uniform way to specify sequences that enables much more dynamic behavior by relying on queries of sequence relationships. But Blockstore is for authoring, and surfacing whatever is simplest and most intuitive for that.
Content Reuse
Content Bundles are created with the intended granularity of reuse. If you intend to have a bunch of problems for a problem bank, a Bundle is an individual problem. If you have a sequence that can be reused in multiple contexts, then the Bundle is your entire sequence. To use either of these, you would make a Link to it and reference the appropriate linked OLX file from a file in your own Bundle (e.g. a CourseRun definition referencing the above Sequence).
...
Reuse as designed for by the original author is done by referencing Bundles, and reuse in ways permitted, but not designed for by the original author is done by referencing specific files within Bundles.
Implementation Details
A non-exhaustive stab at some of the key implementation issues.
Storage
We have to be able to handle files that run from very small to extremely large. The metadata for Bundles will use the Django ORM, but the storage interface for Bundle files will be a pluggable backend that supports at least S3 and the local filesystem. Using django-storages would be nice, but I'm not entirely sure it supports all the features we'd want (e.g. "Content-Disposition" header support so we could reuse the same blob of data with different names).
...
Q: What should be done in a situation where an asset was marked as public in an earlier version but private in a later version (or vice versa)?
Import/Export
Simple import and export of a single Bundle can be done with a tar.gz or zip file. But import and export that involves multiple Bundles at a time would benefit from a command line tool that could talk to our API – particularly if it wants to export a whole Course's worth of Bundles and preserve Links to other Bundles (sometimes even multiple versions of the same Bundles).
Scale and Schema
Current edx.org has approximately:
...
If we assume these limits, then naively creating rows on a per-BundleVersion basis will quickly explode the tables for Links and Files beyond what we want. One approach around this is to more smartly collapse redundant information across Bundle Versions, and another is to take the data out of the database entirely and into the file store.
BundleVersionFiles
Access patterns:
- Common:
- Get URL for single file in a BundleVersion.
- Get names/URLs for all files for a given BundleVersion.
- Less frequent:
- What files changed in this BundleVersion?
- What is the history of this file across all BundleVersions?
...
A drawback here is that it's not easy to track things at a per-file level. On the bright side, it's really simple to implement and understand.
Links
Link relationships are more complicated, because we expect to be able to query them in various ways:
...