Harvard LabXchange Blockstore Proposal

This is the first of several Blockstore draft proposals, the most recent and current of which is at Blockstore Design.


Development of this proposal is being funded by Harvard LabXchange and the Amgen Foundation, with significant in-kind contributions from edX.

This is an early draft, with lots of unsolved problems, and everything subject (and likely) to change. Feedback is very welcome as we shape this to ensure it meets the current and future needs of the Open edX community.

Abstract

All lesson content in the Open edX platform is currently stored in the modulestore, which requires that all content is organized into “courses” that are each a directed acyclic graph (DAG) of XBlocks/XModules (or in “libraries” which are implemented in the same way as courses, but which have a shallower graph and support a limited set of content types).

This proposal outlines a design for a new service that stores content for the Open edX platform, called “Blockstore.” Blockstore is meant to be a lower-level service than the modulestore, and it is designed around the concept of storing small, atomic “units” of content, rather than large, fixed content structures such as courses. In other systems and academic contexts, such units are often called “learning objects,” and Blockstore is thus a type of Learning Object Repository (LOR). For Open edX, Blockstore is designed to facilitate a much greater level of content re-use than is currently possible, enable new adaptive learning features, and enable delivery of learning content in new ways (not just large traditional courses).

Motivation

The Open edX platform currently has very limited support for content reuse and for adaptive learning, both of which are frequently-requested features. The “content libraries” feature was intended to facilitate greater content re-use, but it supports a limited subset of content types and has minimal functionality, due in part to the challenges of implementing a “library” of content on top of a system (split mongo modulestore) that was only designed to store courses.

Requirements for Blockstore are:

  • Authorship and licensing (e.g. Creative Commons vs. all rights reserved, etc.) of content stored in Blockstore must be clear
  • Must support content immutability, or smart handling of attribution (if I import a component that was authored by HarvardX, but then modify it, I mustn’t [be able to] represent it as being pure “HarvardX” content anymore)
  • Must allow tagging of content, probably with taxonomies as a first-class entity that can be linked to content
  • Must support draft-publish workflows
  • Must support both public and private content (private content can only be seen/used by authorized users)
  • Allow a user to re-engage with the same content multiple times, and allow each engagement to have its own user-state. (Needed especially for spaced repetition.)

Design goals for Blockstore are:

  1. To allow easy re-use of content among courses and teams on the same Open edX instance
  2. To enable easier collaboration among specialized teams - for example, a video production team can focus on creating videos and making them available for reuse in multiple courses, in a video collection that they manages themselves.
  3. To provide a foundation for building out advanced adaptive learning features
  4. To be a simple and flexible system for storing content. This implies that Blockstore must:
    1. Have a narrow scope of responsibilities
    2. Offer a clearly defined API
    3. Have predictable behavior
    4. Make as few assumptions as possible about how its content will be used
  5. To re-use important architectural elements of the current Open edX platform - specifically, to support storing content as XBlocks that are defined using OLX.
  6. To make small, potentially independent, units of content discoverable and reusable (e.g. an interactive exercise can be re-used in multiple courses)
  7. To support any existing type of XBlock/component (except non-leaf components like “chapter” which are not required to be supported)

“Nice to have” goals for Blockstore are:

  1. To allow easy re-use of content among Open edX instances (federated content)
  2. To ensure that all OLX it stores is valid (can be parsed and used by an XBlock runtime without errors).
  3. To provide an alternative content file storage mechanism that can eventually supersede GridFS.
  4. To provide a foundation for creation of “custom courses” that can supersede the CCX feature.

Non-goals for Blockstore are:

  1. To be a 1:1 replacement for the modulestore. Courses are currently stored in the modulestore, and Blockstore is not meant to store data about courses or entire course trees.
  2. To implement a distributed version control system (DVCS). Git already does this well, and git can (and is) already used for development and deployment of Open edX course content. Content authoring teams who want advanced DVCS features should embrace git, and store content in git repositories containing OLX. Consequently, Blockstore will not implement “forking” of content, merging, etc.
  3. To provide search functionality. It is expected that some external service will index Blockstore content and facilitate discovery of content via search, filtered browsing, etc.

Proposed Architecture

Core Concepts

The core concepts of Blockstore are Collections, Units, Files, Links, and ContentSets.

Collections

A “Collection” is the key organizing concept of the blockstore. Every other “thing” (piece of content, file, etc.) in the blockstore is owned by exactly one Collection. Collections can also contain references to “things” that are owned by other Collections, even Collections on different Open edX instances; however, the rule that every “thing” is owned by exactly one Collection is always respected.

Typical examples of Collections might be:

  • A collection containing all the content (XBlocks, PDFs, etc.) for a particular course
  • A collection of all videos produced by a certain university’s video production team
  • A collection of Grade 12 physics problems, tagged according to their alignment with the Next Generations Science Standards
  • A collection containing a few interactive animations showing how DNA replicates
  • A collection containing a user’s favourite pieces of content, which were published by other content authors on the platform

Analogy: Blockstore is not a filesystem. But if you thought of Blockstore as a filesystem, then “Collections” would be the root folders of the filesystem (e.g. /biology101, /physics8-problem-library, /harvard-videos, /bobs-content). Blockstore would also not allow subfolders beyond that, and would support symlinks but not hard links. In lieu of anything analogous to subfolders, detailed organization of content in Blockstore is achieved by the fact that Collections and everything they contain can be tagged and discovered via search.

A Collection serves the following purposes:

  • A Collection makes ownership and permissions clear. Everything else in the blockstore is owned by exactly one collection, so it is clear who owns any given thing.
  • Permissions and content licensing are defined at the collection level, so the permissions (public read-only, public read + create derivative works, access restricted to a particular group, private, etc.) and licensing (CC-BY-SA, All Rights Reserved, etc.) of any given thing in the Blockstore are always clear.
  • Collections provide a basic, high-level mechanism for organizing things. Imagine the Studio homepage, which shows the list of courses and libraries you have access to; likewise, some future UI could show a list of Collections that you can have worked on or “favorited”.

A Collection consists of the following data:

  • ID - globally unique identifier (probably UUID)
  • Name
  • Description
  • Owner - who (person or organization) legally owns any original content in this Collection
  • Author - who is credited for authorship of original content this Collection contains, if different
  • Permissions - ACL that defines what users/groups have permission to view and/or edit its content.
  • License - what content license is applied to content in this Collection.
    • For ease of content reuse internationally, for any content that’s not “All Rights Reserved” or “Public Domain”, the use of a Creative Commons license should be encouraged.
    • Blockstore does not enforce licensing, but high-level software that uses the Blockstore may wish to - for example, prohibit “View OLX Source” for units that are not under an open license, and provide shortcuts for copy-on-write modification of units that are.
  • Tags (tags are not stored within Blockstore, nor is Blockstore aware of them, but Collections must be taggable via some external tagging service)

Collections can be imported and exported, much like course OLX tarballs are today.

Content from Collections with public read permissions can also shared across Open edX instances directly, without any import/export process - see “Links and Borrowing” below.

TODO: Are Collections versioned and do they have draft/publish features, or are only the items within them versioned/draft-publish? (If we use the ContentSet concept described below, which is versioned, then Collections probably don’t need to be versioned as well.)

To Solve: Given that tags are not stored within Blockstore: “If someone adds a tag to a Unit in a Collection, has that Collection changed? Because one view is that a Collection is these basic building blocks, and that other things reference data in that Collection. Or we could see the Collection as a kind of holding space where different services can push data about this content (even if they have their own optimized stores for processing).” (One option here is to not store external app data in Blockstore, but to allow external apps like the tagging app to notify Blockstore of changes like tags being added to a Unit, and Blockstore will make the appropriate notifications to other apps and lifecycle updates.)

Units

A “Unit” is a very lightweight container that holds some OLX that defines an atomic unit of learning intended to be consumed as a singular learning experience (consisting of one or more XBlocks). A Unit is typically the smallest “chunk” of content that could be useful independently.

A Unit is roughly is equivalent to a “vertical” in the platform today, and its root element may in fact be a <vertical>, though we may also create a new “‘unit” XBlock type to have less overhead and legacy code involved, and more semantically useful OLX. (Especially given that verticals are not displayed vertically in the native mobile apps.)

Every unit is represented by the following information:

  • UUID
    • Rationale for this being globally unique and not just unique within a Collection is in the “Linkable” section below
    • n.b. This Unit ID would not be a “usage key” (as defined by XBlock runtimes), since this is a dumb content store with no context, i.e. no associated course at this point. A usage key would somehow combine a Course or other context with this Unit’s ID and Collection ID. Or, for Units that are displayed to users outside of the context of any course or other sequence, a usage key would be created by combining the “global context” identifier with the Collection ID and Unit ID.
  • Collection ID (which Collection owns the Unit)
  • Tags (tags are not stored within Blockstore, nor is blockstore aware of them, but Units must be taggable via some external tagging service)
  • Version number, Version history, Published/draft status (see “Linkable” below)
  • OLX Content: OLX (which needs to be defined via a schema and which here is also restricted to some safe[r] subset of XML) defining one or more XBlocks that comprise this unit.

    Example:

    <unit>
       <html>Please answer this question</html>
       <problem id="p1">
           <multiplechoiceresponse>...</multiplechoiceresponse>
       </problem>
    </unit>

OLX content:

All XBlock Scope.content, Scope.settings, and Scope.children fields for the blocks are defined via this OLX blob, and no distinction is made between settings and content fields. Overriding Scope.settings fields for particular use cases is not done at the Blockstore level; it is the responsibility of whatever higher-level software is composing units together into a “course” or other context.

Blockstore does not “understand” this OLX, but it does use an external OLX library to validate that it is valid OLX (and perhaps to inject unique IDs for each anonymous XBlock node in the XML that represents a block that has user state? See next paragraph.)

To Solve: How do we identify XBlocks within each Unit’s OLX? We want some sort of stable IDs for the child blocks in the XML, so that if the Unit has been used by real users, the user_state field data can survive edits to the XML such as insertion of a new <problem>...</problem> at the start of the unit.

  • Possible option: If OLX is imported/created in the blockstore without an ID for each XBlock, one could be auto-generated at that time, with the default attempting to be semantically meaningful (e.g. <problem>...</problem> becomes <problem id=“problem-1”>...</problem>). Note that XBlock itself doesn’t support defining IDs in the XML, so this would require some additional runtime-level support).
  • Alternately: When editing a Unit in Studio, if any IDs are changed, a warning would be shown to the user that the change could affect saved student state.

If content authors want to define hyperlinks to other content or to reference a file in the OLX, they must first create a Link to the other Unit/Resource, which will create a slug. They can then use that slug to refer to the Unit or file in question. For details on how this works, see “Links and Borrowing” below.

Unit versioning:

Units are versioned with a sequential version number, and each version is immutable (so it will be possible to easily refer to changes, e.g. “version 12 of the Unit fixed the problem where the multiple choice answer was incorrect.”). Each unit also defines a “published” version, to support a draft-publish workflow.

Deleting a unit creates a new version that is marked as deleted. When a unit is updated or deleted and that change is published, the changes are propagated optionally: the author of any higher-level structure (e.g. a course) containing the block should see an indicator and can choose to accept the update(s) or not. If a unit is deleted for legal reasons or due to content policy violations, it can be “force deleted”, in which case all uses of it are deleted without giving the authors a choice (though they would be notified so they can find replacement content).

Files (Resources)

A File is a blob of binary content used by one or more Units. A typical file would be an image used in an HTML Unit, or a PDF with course handout materials.

The actual blobs with File content are stored on some external system, such as Amazon S3, or (to allow efficient importing of existing content) GridFS.

Files can be created/uploaded directly into a Collection, but cannot be used until they are “linked” to one or more Units - see “Links and Borrowing”.

Links and Borrowing

A Link is a reference to a Unit, File, or ContentSet (described below; together these are referred to as “Linkable” objects, also described in a later section). A Link is always owned by a Unit or a ContentSet (deleting the Unit or ContentSet that owns the Link will delete the Link).

Links are used by Blockstore to manage content lifecycles (facilitate garbage collection, especially of Files), implement versioning, and implement scoped human-readable identifiers (“slugs”).

Example: An author is creating a Unit containing some HTML/text, and the author wants to include an image. The author uploads the image into the same Collection, which stores the image as a File. Now, in order to be able to reference that image from the Unit, the author creates a Link from the Unit to the newly uploaded File. The author specifies a slug for the image like “author-photo.jpg”, and can then include HTML like
<img src=”./author-photo.jpg”/>
in the OLX. When the content is rendered, the runtime will know how to resolve that URL to the actual URL of the image File, which would likely be an S3 URL.

Note: This is tricky in practice, because we usually want to access the raw value (‘./author-photo.jpg’) when the XBlock is displaying its Studio UI, or otherwise allowing authors to enter a value, but we want to convert it to a real HTTPS URL in all other cases. Currently, there's no good API for this and the situation is a mess, with XBlocks having to use awful code like this: https://github.com/edx-solutions/xblock-drag-and-drop-v2/blob/26a8c5c5edb65beae24707353432654b50e7a230/drag_and_drop_v2/drag_and_drop_v2.py#L826-L843 . We may need to define a new XBlock Field type for "Resource URL", which can either be any HTTPS URL or a resource ID, and the XBlock will always explicitly request either the raw value for editing or the full URL, for use in rendering/previewing the content.

Example: An author is building a course over many months and has a Collection containing a large number of PDFs and images used in the course content. Thanks to Blockstore’s Links, the content management UI that the author is using can display whether or not each PDF or image is in use, and by what Units. The author can easily find and delete old files that are not needed in the current version of the course, and can replace files with newer versions, being confident in knowing how and where those files are used.

Links are represented by the following information:

  1. ID (unique within the collection)
  2. Owner (Unit or ContentSet; cannot be null)
  3. Target (Linkable, i.e. a Unit, a File, or a ContentSet)
  4. Slug (optional, a human-readable, URL-friendly identifier used to refer to the target)
  5. Version (integer indicating which version of the target is in use; all Linkables are versioned)

Borrowing:

Links always refer to data within the Collection, but authors can seemingly create Links to any Linkable (e.g. Unit or File) from other Collections - in that case, Blockstore will “borrow” the Linkable from the Collection that owns it, creating a local copy of that Linkable (e.g. Unit) within the Collection. This borrowing even allows authors to use content from other Open edX instances, if it’s public - by creating a Link to that content, blockstore will fetch the content from the remote instance’s Blockstore and cache the “borrowed” copy of the Linkable within the Collection.

Alternative design: If the main purpose of "borrowing" is to enable cross-instance content use, it might be better to scrap the concept of "borrowing" and just implement Collection replication, where any external Collection can be mirrored (in full) on the local Open edX instance. Then Links can be cross-collection locally, including to mirrored/replicated external Collections, but there's no need for "borrowing" units into a Collection.

To solve: Are Links mutable or Immutable? If a Unit has a Link to some file, and a new version of the Unit has no Link or a different Link, how is that handled? Are Links owned by not just a Unit/etc. but a particular version of a Unit?

To solve: How are Links and Files represented in a Collection’s OLX? Previous versions of OLX have not had the concept of explicit Links.

Idea: A link can be strong or weak. A strong link means that if whatever owns the link is imported into a collection, then the object of the link gets imported too. A weak link means that if the thing the link points to happens to be in the same collection, the link is available, and if not, it’s not. Weak links would be useful for Units that contain text like “For more details on Combustion, see the Combustion Chapter” - someone who imports that unit doesn’t necessarily want to import the whole Combustion chapter as well. (Except Blockstore doesn’t support any structures higher than Units, so one can’t link to a chapter. Maybe this isn’t useful.)

Linkable

Units, Files, and ContentSets have much in common - they all have a name, are versioned, are owned by a Collection, and can be the target of a Link. Thus all three share a common interface, which is that they are Linkable objects.

A Linkable is a Unit, a File, or a ContentSet.

Every Linkable has these fields:

  • UUID (globally unique, so that borrowed Units/Files/ContentSets never conflict with each other or with existing Collection-local objects)
  • Collection ID (which Collection owns the Unit/File/ContentSet)
  • Tags (tags are not stored within Blockstore, nor is blockstore aware of them, but all Linkables must be taggable via some external tagging service)
  • Version number (TBD: simple incrementing integer, or major+minor version integers - it has been argued that it's important for authors to distinguish between major breaking changes that users want to opt into, and minor typo fixes which won't affect grading etc. and which can be safely pushed out to all users of the content without needing them to choose if they want the update.)
  • Version history (Linkables are immutable, so changes create a new version with an incremented version number, but the past versions are preserved.)
  • Published/draft status

Further Concepts

The following concepts are not yet well developed.

ContentSet

A ContentSet is a collection of Links. ContentSets represent all the content that may be used for some particular application, such as a course, or a short playlist of interactive exercises. The main purpose of a ContentSet is to indicate that the objects it links to are in use for some use case. An alternative name for ContentSet could be UseCase.

Each Link in a ContentSet can refer to any Linkable object, i.e. a Unit, a File, or another ContentSet (note: infinitely recursive structures could exist, where ContentSet A contains a link to ContentSet B, and vice versa, but that’s not considered a problem, because Blockstore doesn’t “follow” links indefinitely; it just ensures that every unique Linkable that is referenced is either owned by the Collection or “borrowed” into the Collection.)

ContentSets are useful for managing resource lifecycles, modelling structures like courses or playlists, and for efficient notifications. Applications built on top of Blockstore can request notifications from Blockstore whenever any item in a ContentSet changes. (Such notifications will follow all Links - e.g. one could be triggered even for changes to a Unit linked to a ContentSet linked to a ContentSet linked to a ContentSet linked to the one ContentSet that’s explicitly being watched.)

ContentSets are extremely flexible and have many applications. Blockstore itself doesn’t try to understand what ContentSets mean - their use entirely is left to other applications. But some examples are:

  • A physics teacher searches all publicly available Open edX content and finds several videos about the second law of thermodynamics, as well as an interactive animation that explains entropy. She creates a “Playlist” consisting of three video Units and an HTML Unit that contains the animation.
    • The “playlist” application internally creates a Collection containing one ContentSet to represent this playlist. The ContentSet contains four Links (one to each Unit). Separately, the playlist application records the order of the links.
    • If any of the Units used in the playlist get changed or deleted, the Playlist app will receive a notification and can choose to update the Playlist by updating the Links in the ContentSet or not.
  • A typical Open edX course tree could be modelled using ContentSets plus metadata: A root ContentSet would contain Links to each “chapter” ContentSet, which would each contain Links to each “subsection” ContentSet, which would each contain Links to each Unit.
    • This is not saying courses should be modelled that way, but is pointing out the flexibility. In particular, courses modelled as ContentSets could be cyclic graphs as far as Blockstore cares, potentially even worse than today’s course DAGs in terms of edge cases.
  • A fully adaptive course may not define a hierarchy of content, but rather consist of a ContentSet containing a vast pool of content, and a machine learning-based adaptive algorithm that constructs a unique “pathway” through the content for each learner. (Learner 1 sees Intro > What are Objects > Subclassing > Test, while Learner 2 sees Intro > Advanced Topics > Advanced Inheritance in Python > Test > Advanced Test).

TODO: Is this sort of model useful, or do we really just need a way to indicate that certain content is in use for some application, without any attempt to model structure?

TODO: Should ContentSets support (optional) ordering of the links they contain? It seems like an extremely common use case, and would make a lot of things (like modelling playlists and course trees with ContentSets) cleaner. If ordering etc. is always kept externally, then we may not need the ContentSet concept at all.

Alternative design: To simplify the system, we could disallow ContentSets from containing Links to other ContentSets. Applications which use Blockstore and want to be able to model hierarchies or “playlists of playlists” would have to model more of their data externally and either subscribe separately to every ContentSet they use, or create a “flat” ContentSet representing all the items in the hierarchy/playlist, and subscribe to changes on that.

Alternative design: A ContentSet is not that dissimilar from a Collection. We could simplify the designing by eliminating ContentSets entirely and allowing Collections to own Links (to Units, Files, and possibly other Collections). Conceptually this would be similar to what the ContentSet concept offers, except that it would result in more Collections, which potentially means more redundant data (each Collection must separately have its defined owner/author/license/permissions/tags, whereas with ContentSets, that data can be set only once for a Collection that contains hundreds of lightweight ContentSets) and which may be less efficient. The gain in simplicity may be worth this tradeoff, however.

Publisher

TODO: Content in Collections are legally owned by some entity (person or organization). We need some way to model that owning entity in the system, in do so in a way that is compatible with reusing content from remote Open edX instances.

TODO: Do we need this model to distinguish between authors, owners, and publishers? (and what brand name/logo is displayed) Or can we consolidate those into once “organization” concept?

TODO: To we need this model to reflect organizational hierarchies? (e.g. reflect that Harvard Kennedy School and Harvard Law School are both schools of Harvard, and even allow sub-organizations within those schools to own, manage, and brand content separately?)


Old concepts (TODO: clean up and add in results of further discussions):

A publisher is a name (person, organization, or brand) to whom content is attributed. It is an evolution of the “organization” concept that currently exists on the platform, but extended to be meaningful across Open edX instances (federation / content import-export) and to recognize that some content authors may be individuals.


A publisher would be a lightweight concept represented by the following data:

Name: human-readable name of the publisher

e.g. “HarvardX”

Logo: logo of the publisher, if applicable

ID string: equivalent to the “organization” currently used in Open edX (no spaces)

E.g. “HarvardX”

Home page URL

IRI: unique, valid identifier. (e.g. “https://harvardx.harvard.edu/olx”) This would be the primary key for each publisher and is only piece of information used in the blockstore itself.

In the future, the IRI above could be the address of some standardized JSON/XML data that any Open edX instance can use to download metadata about the publisher, such as:

Name, logo, ID string, home page URL

Public Key (future extension): a public key that can be used to verify content signatures, to check if content really was authored by this publisher. Not considered a priority until someone has a use case though.


All content in the blockstore is owned by a publisher, canonically identified by its IRI (although internally a simple integer foreign key would be used). For example, a content author at HarvardX creates content on edx.org and exports it. The exported data will identify the publisher as “https://harvardx.harvard.edu/olx”. When importing it to a blockstore on any other instance, if the instance doesn’t have any HarvardX content yet, then it could load some XML/JSON publisher details from that URL to fill in its local database with the details (logo, name, public key) of the publisher.


Todo: it would be possible to have different IRIs but identical ID strings, and much of the platform currently assumes the ID string (“HarvardX”) is a unique key too. Would that be a problem in practice?

Taxonomies

TODO

A set of possible tags that can be applied to content. Should these be owned by Blockstore or external? We want to avoid having competing tagging systems, and when importing content from a remote instance, or importing a Collection from a tarball, it’s important to import the taxonomies as well.

API

TODO: Detail what sort of requests one can ask of Blockstore, and what it will return.

  • CRUD for Collections, Units, Files, ContentSets, and Links.
  • Subscribe to changes on a ContentSet or Collection.
  • Notify Blockstore than external data (e.g. tags) about a Unit, File, ContentSet, Collection, Link, etc. has changed, so Blockstore can pass on the notification to subscribers.
  • Register hooks to transform OLX on export or receive notifications on OLX import.

Implementation

TODO: Blockstore will be implemented as an independent django app, with a RESTful API.

Applications and Development Path

TODO: expand

Implement Blockstore

Implement import of content from existing courses

Implement “Randomized Content Block v2” that can draw from Blockstore, not Libraries

Allow editing Blockstore content from Studio

Allow using Blockstore content in modulestore courses

Replace CCX feature with Blockstore-based custom courses, and allow them to be used in the LMS, with no modulestore involved.

Rationale

TODO: The rationale adds to the specification by describing the events or requirements that led to the proposal, what influenced the design, and why particular design decisions were made. The rationale could provide evidence of consensus within the community and discuss important objections or concerns raised during discussion. It could identify any related work, for example, how the feature is supported in other systems.

Rejected alternatives

Collections as saved queries

Instead of “Collections”, one could imagine an instance-wide pool of content, with some sort of tagging/searching/filtering based “collections,” where each “collection” is a saved query that provides filtered view into the shared pool of content. That approach was rejected because it creates a lot more complexity for users (steeper learning curve, more complex mental model is required), and makes permissions/ownership more complex. Of course, some way of finding/grouping content using saved search queries could be added on at any time, and is not exclusive with the proposed design.


Units contain XBlocks but are not XBlocks/OLX

A previous version of this proposal suggested that Units were a list of XBlocks, and not XBlocks themselves. It was decided that defining Units via OLX, where the unit is the root element, was more useful and more consistent with existing platform patterns.

Backward Compatibility

TODO

This will mostly be built and deployed in parallel with existing things, so should be fairly backwards compatibility friendly.