Blockstore Implementation Proposal

This is an implementation proposal for Blockstore. The focus will be on defining a separation of concerns for Blockstore as an independently deployable application, and how its concerns should manifest. Where some functionality is delegated to an external service, a brief explanation of how the external component provides that service in the context of Blockstore is given.

Note: There are obvious kinks in this approach, but it's primary concern now is to help understand how we could approach a purely database-based Blockstore with assets still in a third party object storage service.

Blockstore Concerns

Blockstore will be fully dedicated to focusing on storing, referencing, managing, and/or allowing the discovery of Collections, Units, Assets, Links, and Versions. Each of these models has their own detailed section below explaining their purpose & role in Blockstore, as well as their implementation plan.

There is no dedicated section describing what Blockstore should not do, because it is simply the inverse of the above and what is to follow.

Blockstore Models

For reference, we group Units and Assets into "Linkables", and those plus Collections and Links into "Versionables".

Collection

A Collection is an abstraction over a group of linkables to provide common metadata to that group.

Note that the fields in a Collection are mostly metadata, and although many can be argued to be foreign keys to separate models that give a bit more detail for each field, in this first iteration of an implementation draft we simplify things and only use CharFields – the actual implementation can use extra models as necessary.

Collections are versioned and thus derive from Version.

Name	Type	Description
`uuid`	`UUIDField`	Universally unique identifier for the `Collection`.
`name`	`CharField`	A descriptive title for the `Collection`.
`description`	`CharField`	A summary of what this `Collection` contains.
`owner`	`CharField`	Who legally owns any original content in this `Collection`.
`author`	`CharField`	Who is credited for authorship of original content in this `Collection`.
`license`	`CharField`	The license bound to all content in this `Collection`.

A notably missing field here is permissions which was shown in the original proposal. The reason for leaving it out is because it wasn't obvious how this field could be properly populated given that Blockstore will have no concept of auth (see the Security section below). It could technically be a mapping of usernames and groups to permissions, e.g. an ACL, but those values would mean nothing to Blockstore and would make more sense to be handled at a higher level component which works with the result of composition done by the Compositor service (see Compositor Service below).

Unit

The fundamental unit in Blockstore is literally "Unit" as originally proposed.¹

Units are versioned and thus derive from Version.

Name	Type	Description
`uuid`	`UUIDField`	Universally unique identifier for the `Unit`.
collection	`ForeignKey(Collection)`	The `Collection` that this `Unit` belongs to.
content	`BinaryField`	OLX content for this `Unit`.
`links`	`ManyToManyField(ContentType, through="Link")²`	All of the `Assets` that are linked to this `Unit` through `Link`.

(1): Although this may clash with existing edX Studio vocabulary, we use it because it is the best fit for what it's referring to in practice. To prevent any confusion and be more complete one can say, "Blockstore Unit", however here and internal to Blockstore we stick to "Unit".

(2): I have no clue if this even works, but you get the idea.

Asset

Resources that aren't OLX, that're needed inside of the content defined by OLX, and that have a valid, reachable URL, can be added as Assets to Blockstore.

Assets are versioned and thus derive from Version.

Name	Type	Description
`uuid`	`UUIDField`	Universally unique identifier for the `Asset`.
`collection`	`ForeignKey(Collection)`	The `Collection` that this `Asset` belongs to.
`file`	FileField	The file that backs this `Asset`.

Link

A single unit may require multiple assets in it, and a single asset may be used by multiple units. A Link is an intermediate model which facilitates the logic and metadata needed for the many-to-many relationship between units and assets.

Note that since Links are versioned, new Links would have to be created every single time a Unit is updated, for every single target that is linked to that Unit. This could be quite stressful on the database and take up a lot of rows.

Name	Type	Description
`uuid`	`UUIDField`	Universally unique identifier for the `Link`.
`source`	`ForeignKey(Unit)`	The `Unit` portion of this link.
`target`	GenericForeignKey	The target portion of the link, which may be an Asset or a Unit.

Version

A Version is an abstract model to be used by all "versionable" models for common fields and interfaces to versioning information.

Name	Type	Description
`timestamp`	`DateTimeField(auto_now_add=True)`	The time this version of the content was created.
`version_id`	`PositiveIntegerField`	A number identifying the version of the content in a history. When `NULL`, this version is a draft.
`garbage_collect`	`BooleanField`	Whether this content's old versions should be garbage collected after some time.

Blockstore REST API

Here we present a relatively simple CRUD REST API for the models discussed above.

TBD

Blockstore Garbage Collection

Blockstore, being responsible for immutable, versioned data, can have its storage space eaten up rather quickly if some garbage collection mechanism isn't in place. Consider also, for example, that in a really large deployment, multiple terabytes of video assets may be stored in an object storage solution, where many of the videos may have been orphaned at some point. Unless an automaton can collect and remove these orphaned assets, object storage costs will come from a lot of waste.

Old Versions

Blockstore doesn't store a Unit's OLX as a diff, similar to version control systems like Git. Since Units are immutable, and each new version of a Unit contains all overlapping content from the previous version rather than the diff, the total size of all of a Unit's versions can quickly grow out of hand.

However since Versions are based on timestamps, it's trivial to define a celery task that iterates through all versioned content to delete anything older than a configured number of days.

So, Blockstore can simply define a celery task to do exactly that as its garbage collection mechanism for old versions. We'd have to take care of edge cases such as when the latest version meets the criteria, but that is a simple piece of logic.

Orphaned Assets

Many Assets may become irrelevant as Assets of higher-quality or with edits need to replace their older counterparts. Those replaced assets are effectively orphaned, meaning they aren't used in any Unit anymore, but are still physically available through the object storage service, and are thus taking up space.

Detecting orphaned Assets and garbage collecting them can be done immediately as they become orphaned, or on a scheduled interval. We only consider the former option here, and the latter is essentially equivalent to what was described above about garbage collecting old versions of content.

Before considering how to implement this, it's important to think about what action needs to happen for an Asset to become orphaned.

An Asset will become orphaned when the size of its set of Units is 1 and the remaining Unit removes its link to that Asset. So the only way to immediately know if an Asset has been orphaned is to look for the removal of the last Link between an Asset and any Unit at all.

Thus, to implement the first option of being able to remove orphaned Assets immediately:

Listen in on the delete signal for Links.
If the Asset in the Link has only a single member in its set of Units, delete the Asset if Asset.garbage_collect == True, since the removal of this Link is the last before the Asset is orphaned by the above definition.
Continue with the removal of the Link.

Blockstore Operations

Blockstore is an independently deployable (django) application (IDA), or "microservice". It is operationally quite simple, and we give some recommendations below of how to operate it.

Database

The main concern for an operator of Blockstore is the handling of its dedicated database, which is where Blockstore stores instantiations of its models described above.

The database itself should be relational to allow us to utilize the Django ORM.

Django has drivers for two very popular relational databases, MySQL and PostgreSQL, and although Blockstore won't utilize any special feature of either, allowing Blockstore operators to use either with the correct configuration, the recommendation is to use MySQL. Consider that:

Most Open edX operators currently use 3 different database systems (MongoDB, MySQL, Elasticsearch).
At scale, each of these must be clustered, replicated, backed up, etc..
Operators must be decently acquainted with each system to actually operate it.

It is operationally inconvenient given the above to introduce another database system to the mix. If there was a push for using a new and different system for all new services from hereon, that would be a different story, but we aren't aware of such a movement.

Object Storage

Using a database to store binary objects isn't scalable, so we need to operate (or utilize the third party services of) some dedicated object storage solution that optimizes for it.

Since an Asset uses FileField, it is blind to the underlying storage being used, as long as the object storage exposes a URL for the resource. Therefore we can use any object storage that has an implementation for the File and Storage Django interfaces.

AWS S3 is an excellent candidate, and the django-storages application implements those interfaces.

Security

Clients will be primarily interacting with the Compositor service to make good use of Blockstore, so Blockstore would be mostly on the "internal" side of Open edX, meaning there isn't a real need for it to be publicly accessible. Blockstore can simply listen on the loopback interface in a sandbox setup, or if given a dedicated (virtual) machine, interact only through a private network.

Blockstore's object storage solution needs to allow reads and writes to Blockstore to achieve many functionalities that involve uploading and downloading Assets, including import and export.

In this proposal, no mention was given to authentication and authorization in the context of Blockstore. If Blockstore is not publicly accessible, this is in fact not a problem. Other services that do implement authentication and authorization at their REST API, like potentially Compositor, can serve as sort of "proxies" to Blockstore; if clients successfully authenticate/authorize against said service, said service interacts with Blockstore on behalf of the client through its own interface's abstractions. In a sense, Blockstore is a part of a "domain layer", Compositor a part of an "application service layer", and clients simply don't interact directly with the domain layer, but rather only through the use case abstractions provided by the application service layer.

Blockstore & Friends

Blockstore alone is a perfectly fine bag of content, but reaching into that bag and giving meaning to the handful is a practical requirement that can (and should) be met by other services.

Tagging Service

TBD

Compositor Service

Blockstore only consists of the fundamental pieces of learning content, and does not worry about composing more meaningful constructs out of them, such as edX Courses or LabXchange Pathways. Indeed each specific construct comes with its own ways of giving meaning and function to certain metadata, and it would be surely out of Blockstore's bounds to understand these constructs in-depth in order to fulfill potentially vastly different requirements successfully. That is where a separate service, dedicated to that purpose, comes in, namely the Compositor service.

To provide some more motivation, consider that although Collections are versioned, and courses/pathways can correspond directly to all of the contents inside of a particular collection, that wouldn't be enough to define what a version consists of for higher-level constructs like a course or a pathway. That's primarily because arbitrary structures, whether they're courses or pathways, can have arbitrary definitions for what a version means for them at their level. Thus something that understands these arbitrary, higher-level constructs more in-depth should be the one responsible for also implementing versioning for them in a way that makes sense for those constructs, since Blockstore should not and cannot be responsible for understanding them.

The Compositor service, then, can use Blockstore's REST API to gather the UUIDs of Units in a Collection, and create a construct of arbitrary form out of those. When needing to make the construct physical, it can contact Blockstore to get the actual content associated with the UUIDs, create the construct using its built-in knowledge of it, and return it to the client.

Further discussion of the Compositor service is out-of-bounds for this proposal, but the question of how the Compositor service has knowledge of these constructs (e.g. edX Courses or LabXchange Pathways) will likely be of most importance when considering a proposal for the service.

Architecture and Engineering