This is a rough, speculative proposal for a direction the Blockstore implementation can take, and builds on the Harvard LabXchange Blockstore Proposal. There are going to be a lot of statements about what Blockstore is and isn't to establish the outline of this approach, but no part of this proposal has been accepted in any way.

Table of Contents

General Themes

The high level ideas that ground this proposal (in no particular order) are:

...

Blockstore is a lower-level system that enables the storage, versioning, tagging, and re-use of content. Much of the intelligence required to work with that content lives outside of Blockstore itself. So for instance, Blockstore has no concept of Units, Sequences, or XBlocks (side note: the name Blockstore makes a little less sense in this proposal), beyond the notion that there's a "type" called "olx/unit".

Layers

Name	Responsibilities
Core	This is the storage of the content itself, and essential mechanisms for updating it. This layer doesn't understand anything about the actual files in the Content Bundle. Content data models: Content Bundles, Versions, Links, Drafts (side note: it's possible that Drafts can move out of Core) Roles Signals
Persistence	Low level swappable piece that determines how we store the files for bundles. Must support at least S3 and the local file system. Extension point. It'd be nice to use django-storages, but I'm not sure if it supports all the features we'll want, like setting headers dynamically (e.g. Content-Disposition). Graceful handling of large assets matters.
Plugin	This is the more extensible layer that manages the constellation of metadata about a Bundle. It should be easy to add these over time, possibly in a separate repo. This layer actually does understand the contents, and might subscribe to events for particular content types. Tagging Search The discovery aspect is interesting because XBlocks will exist in various Content Bundles at different granularities depending on the lifecycle/requirements. But I should still be able to say "What are all the capa problems in my Collection" regardless of whether they're standalone or in a Sequence. Webhook notifications Licensing Dispatch Import hooks (per-type) for things like validation.
Execution (external process)	The XBlock runtime that actually executes content, probably at a Unit level. This lives in a separate process and needs to know how to grab content from Blockstore for the purpose of preview and authoring.

Blockstore Core Layer Concepts

Term	Definition
Content Bundle	A group of files that are versioned together and can be accessed from other Content Bundles. Blockstore stores a UUID and basic metadata about a Content Bundle, like a title, slug, and "type". UUID never changes. Slug must be unique within the Collection. Should be immutable...? Blockstore does not parse or understand the actual contents of the files in Content Bundles, though it may know how to delegate certain actions like preview and editing based on type.
Bundle Version	An immutable snapshot of a Content Bundle. Once created, the contents for the files in a given Bundle Version does not change. Metadata about a Version can change though, like tagging. That will often be asynchronously updated after a new Bundle Version has been created. Creating a new Bundle Version emits a signal that other parts of the system will listen for, similar to how course_publish works today. Importing creates a new Version directly (if any changes were made). It does not interact with Drafts. Versions and even entire Content Bundles can be force deleted if necessary to to deal with copyright violations or other inappropriate content. This will likely break any Content Bundles that has links to the deleted Content Bundle. Force deletion may not be possible for content that is copied/referenced across Blockstore instances, but that entire use case is still very fuzzy right now.
Link	The method by which one Content Bundle references and uses content from another. Since a Content Bundle is a collection of files, this is conceptually a symlink between one Content Bundle Version and another, via a special named folder (e.g. `.blockstore/links/want_tutorial/videos/wand_demo.mp4`). The version number is not encoded into the symlink path. This is so that updating to the next Version doesn't require changing the path. A Link has: An alias – the name of the symlink, essentially. A target BundleVersion. For now, all Links are assumed to be local to the server instance, but later we could add an optional server namespace field. Since a Content Bundle is immutable, updating a Link results in a new Version. Cycles are not allowed (bad things happen, like infinite version bumps). A Content Bundle Version cannot have Links to multiple Versions of another Content Bundle. So A.v1 can make a Link to B.v1 or B.v2, but A.v1 cannot simultaneously access files from both B.v1 and B.v2. Exporting multiple Content Bundles and preserving versioned Link information will likely require a smart client that knows how to pull down Content Bundle Versions to a shared namespace and then generate the necessary symlinks on disk. More on that later.
Version Range	A lot of annotation around a Content Bundle (e.g. tagging around teaching standards) is going to be about content that may change over time as new Versions are created. Associating them with the Content Bundle as a whole might be inaccurate. Associating them with specific Versions might be wasteful, particularly when the changes are relatively minor. Treating a Version Range as a first class concept would help to simplify data modeling in other parts of the system. Version Ranges need a start version, but the end version can be null (i.e. open-ended).
Draft	A mutable space for changes to be made before they are committed to a Bundle Version. There are no "draft" vs. "published" branches. Drafts get committed to Bundle Versions, and Bundle Versions keep increasing. Because versions are shared and immutable, there may be multiple Versions of a given Bundle that are live in different courses at any given point. Import creates new Bundle Versions directly and does not interact with Drafts. Under the covers, Drafts are copy-on-write. Edits of Drafts are done at the individual file level, so it's possible for two people to be concurrently editing different units in the same sequence, as long as the units are separate files.
Collection	1:M grouping of Content Bundles. There is no Collection-level versioning. It's a pointer to a bunch of Content Bundles which have their own versions. Ownership is captured at the Collection level. Permissions are determined at the Collection level. Licensing is determined at the Collection level. Mapping today's concepts: A Course would have most of its content one Collection. Multiple runs of the same course would be in the same Collection. So perhaps it's more accurate to that the contents of a CatalogCourse are in a Collection? A Content Library would map to a Collection. The Use Cases here are a bit different. Some of it overlaps with Discovery, but a lot would deal with management – what's outdated content, give me the list of all things that have property X, etc. Import/export could happen at the Collection level, but in practice we'd want to very strongly lean towards allowing subsets of Collections to be imported or exported. Questions: Would this be problematic for changing licensing information over time? Does it need to be fixed to particular Content Bundle Versions? Is there some notion of related Collections? A course might use a bunch of problems from a problem bank – they might be different Collections, but it seems useful to be able to associate them? Maybe that's overdoing it? How do we delete Collections? Safe if empty? Safe if none of the Content Bundles are referenced externally?
Signal	We'll emit named Django signals for life cycle events around the Core layer, including: Content Bundle creation and deletion. Version creation and deletion. Link creation and deletion (happens on Version creation) Collection creation, update (e.g. title, ownership, roles)

Higher Level Concepts

These concepts exist for the systems that author content in Blockstore, but Blockstore itself is unaware of them.

Term	Definition
Learning Context	A grouping against which student state is stored. General state is stored via a (context, student, block) tuple. If a student sees the same problem block in multiple places within a context, their state (answer, saved work) carries over. If a student sees the same problem block in a different context, their state (answers, saved work) does not carry over. Courses and Pathways reference Learning Contexts but they are not Learning Contexts themselves. The common case will be that each Pathway will use a separate Learning Context, but there may be situations in which you'll want multiple Pathways to use the same LearningContext. Open Issues: The case of XBlock state seems straightforward, but does all state really follow this (e.g. completion, scoring, gating, etc.)?
Block	The smallest piece of content that can be authored, this maps to a leaf-node XBlock ("Component" in Studio) in edx-platform today.
Unit	An ordered list of Blocks that is the smallest chunk in which content can be consumed. A single page worth of content. Maps to a VerticalBlock today, but that name is horrible – we should either create a new Unit Block or make it an alias of Vertical (that's really what it should have been named in the first place). Making a new Block might allow us to drop support for some legacy cruft. While Blocks can be almost anything, Units are expected to have some common fields like title, possibly an icon, a URL that can be independently rendered, etc.
Sequence	Sequences are a linear group of Units. For existing courses, Sequences are largely static, though there are exceptions in randomized problems, A/B testing, and the adaptive use cases.

Why Files?

This proposal leans on the file system metaphor more strongly than the initial proposal. Some advantages to this:

We can unify content storage and lifecycle updates.
Studio currently stores authored course content in three different ways. Data that manifests as content and settings scoped XBlock field data are stored in the ModuleStore, as sort-of versioned documents in MongoDB. Smaller binary assets like images and PDF files are stored without versioning via the ContentStore interface, which writes to GridFS. For performance reasons, self-hosted videos are typically put in S3. Treating these in a more uniform way will help when building add-on functionality like search, tagging, and licensing. It should also lower cost and operational complexity.
It simplifies the overall design.
Blockstore offers the capability to store, version, and reuse content. The more it knows about the internals of that content, the more coupled it will be, making it hard to adapt and make changes. For example, adding a major feature like internationalization of OLX should require no changes to the core of Blockstore. Files also help with the partial sharing use cases (more on that in the Content Reuse section).
Assets can get surprisingly sophisticated.
Splitting the world into sophisticated XBlock-like data and simple binary blob assets is intuitive, but many assets are more sophisticated than they appear. A Video can be represented by a single mp4 file, or it can be a set of HLS files with different quality video encodings, multiple audio tracks, and subtitles for many languages. An ebook may come in three different formats, that are grouped, licensed, and updated in sync with each other. Even simple assets can become more complex when internationalization comes into play.
It leads to more author-friendly grouping for reuse.
OLX and static assets can live side by side in the same versioned Bundle, without requiring any external links. If there are precursor files like source LaTeX files that compile out to OLX, those can be stored in the same Bundle. They would be ignored by the XBlock runtime, but still be very valuable to store, version, and share.
We need a serialization format for import and export anyway.
Import and export are going to be a critical part of the supported workflow. This lets Blockstore be agnostic to OLX conventions that are handled at the XBlock runtime layer. The only file conventions it imposes are its own simple notions of Bundle metadata and Links.

Using Blockstore to Model Courseware

So what would the Content Bundles look like?

Content Bundle File Conventions

Blockstore exposes Bundle-level metadata as a .blockstore directory.
- This folder a fully virtual folder (nothing actually exists there on S3, and it is optionally materialized on export)
Metadata is exported as JSON files.
All Links to other Content Bundles will be of the form .blockstore/links/{alias}
- Link mapping is stored in the .blockstore/links.json directory.

Standalone Problem (leaf-level XBlock, appropriate for problem banks)

Code Block

quicksort_complexity/
                     # Actual definition of the OLX for the problem.
                     problem.xml
 
                     # Convention: Files in static/ will be available to the browser during
                     # execution. For security reasons, we wouldn't just serve them straight
                     # from Blockstore, but the LMS will need to know what's okay to send to
                     # the browser vs. what might have solutions encoded into it.
                     static/         
                            diagram.png
 
                     # Blockstore metadata
                     .blockstore/
                                 info.json   # UUID, instance, title
                                 links.json  # Empty dictionary -- no links here

Video (centrally managed, probably in a separate Collection)

Code Block

the_perfect_egg/
               # VideoModule OLX
               video.xml
               static/
                      hls/
                          playlist.m3u8
                          
                          # Dozens of alternate languages and encodings would
                          # be in the following directories:
                          audio/
                          subs/
                          video/

               links/
                     wand_tutorial_part_1/video.xml
                     wand_tutorial_part_2_feathers/video.xml
 
               # Blockstore metadata
               .blockstore/
                           info.json   # UUID, instance, title
                           links.json  # Empty dictionary -- no links here

Static Sequence (either standalone pathway or part of a Course)

Course Run Example

The core part of this would be some sort of document that has navigation information and provides pointers to all the different sequences that make up a course. For example, you could have a Content Bundle that defined an XML file that looks something like:

...

Sequences are separate entities, Chapters (and other hierarchy between Course and Sequences) are details of Course.
In this particular mindset, "chapters" don't really exist as a separate entity, but only as a navigational convenience of Course. Everything is sequences and things that point to sequences. If a Course wants a flat list of sequences or a hierarchy an extra level deep – that's purely a Course Navigation concern. Unlike today, you wouldn't expect to be able to call a student_view on a chapter (that doesn't really work well in practice, but the framework currently allows for it).
Course navigation isn't an XBlock, and might not even be OLX.
For content compatibility reasons, it's really important that the individual Units be OLX. Maybe the Sequences as well, though I think there's a decent case to be made either way there (and even if it is OLX, it doesn't necessarily have to be backed by an XBlock runtime). But the things-that-point-to-sequences don't necessarily need to be OLX. The data model for courses/chapters is a combination of a simple container hierarchy that can be translated to in a straightforward way, and a mess of global policy attributes that need to be moved out. It should have some reasonably human-readable serialized format, but whether that's XML, JSON, YAML, or something else that better fits our "X + overrides" use cases
This is not the LMS data representation.
The way sequences are referenced here would be terribly inefficient if we had to make read calls to each sequence to get the titles to display. When we actually get to the representation in the LMS, we might want a uniform way to specify sequences that enables much more dynamic behavior by relying on queries of sequence relationships. But Blockstore is for authoring, and surfacing whatever is simplest and most intuitive for that.

Content Reuse

Content Bundles are created with the intended granularity of reuse. If you intend to have a bunch of problems for a problem bank, a Bundle is an individual problem. If you have a sequence that can be reused in multiple contexts, then the Bundle is your entire sequence. To use either of these, you would make a Link to it and reference the appropriate linked OLX file from a file in your own Bundle (e.g. a CourseRun definition referencing the above Sequence).

...

Reuse as designed for by the original author is done by referencing Bundles, and reuse in ways permitted, but not designed for by the original author is done by referencing specific files within Bundles.

Implementation Details

A non-exhaustive stab at some of the key implementation issues.

Storage

We have to be able to handle files that run from very small to extremely large. The metadata for Bundles will use the Django ORM, but the storage interface for Bundle files will be a pluggable backend that supports at least S3 and the local filesystem. Using django-storages would be nice, but I'm not entirely sure it supports all the features we'd want (e.g. "Content-Disposition" header support so we could reuse the same blob of data with different names).

...

Q: What should be done in a situation where an asset was marked as public in an earlier version but private in a later version (or vice versa)?

Import/Export

Simple import and export of a single Bundle can be done with a tar.gz or zip file. But import and export that involves multiple Bundles at a time would benefit from a command line tool that could talk to our API – particularly if it wants to export a whole Course's worth of Bundles and preserve Links to other Bundles (sometimes even multiple versions of the same Bundles).

Scale and Schema

Current edx.org has approximately:

...

If we assume these limits, then naively creating rows on a per-BundleVersion basis will quickly explode the tables for Links and Files beyond what we want. One approach around this is to more smartly collapse redundant information across Bundle Versions, and another is to take the data out of the database entirely and into the file store.

BundleVersionFiles

Access patterns:

Common:
- Get URL for single file in a BundleVersion.
- Get names/URLs for all files for a given BundleVersion.
Less frequent:
- What files changed in this BundleVersion?
- What is the history of this file across all BundleVersions?

...

A drawback here is that it's not easy to track things at a per-file level. On the bright side, it's really simple to implement and understand.

Links

Link relationships are more complicated, because we expect to be able to query them in various ways:

...

Versions Compared

Old Version 43

New Version 44

Key

General Themes

Layers

Blockstore Core Layer Concepts

Higher Level Concepts

Why Files?

Using Blockstore to Model Courseware

Content Bundle File Conventions

Standalone Problem (leaf-level XBlock, appropriate for problem banks)

Video (centrally managed, probably in a separate Collection)

Static Sequence (either standalone pathway or part of a Course)

Course Run Example

Content Reuse

Implementation Details

Storage

Import/Export

Scale and Schema

BundleVersionFiles

Links

Page Comparison

Versions Compared

Old Version 43

New Version 44

Key

General Themes

Layers

Blockstore Core Layer Concepts

Higher Level Concepts

Why Files?

Using Blockstore to Model Courseware

Content Bundle File Conventions

Standalone Problem (leaf-level XBlock, appropriate for problem banks)

Video (centrally managed, probably in a separate Collection)

Static Sequence (either standalone pathway or part of a Course)

Course Run Example

Content Reuse

Implementation Details

Storage

Import/Export

Scale and Schema

BundleVersionFiles

Links