Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 44 Next »

This is a rough, speculative proposal for a direction the Blockstore implementation can take, and builds on the Harvard LabXchange Blockstore Proposal. There are going to be a lot of statements about what Blockstore is and isn't to establish the outline of this approach, but no part of this proposal has been accepted in any way.

General Themes

The high level ideas that ground this proposal (in no particular order) are:

  1. Content Bundles are the only thing that Blockstore stores. Content Bundles are a local grouping of files.
    Blockstore doesn't understand much about the things inside of it. I'm using the term "Content Bundle" here instead of Unit because in practice many Content Bundles will likely be Sequences, and it's confusing to use the word "Unit" given its Studio connotations. But in any event, there is no special data structure within Blockstore itself for Sequences vs. Units vs. anything else. OLX content, smaller assets like images, and larger assets like videos are all treated in a uniform way.
  2. Blockstore uses a file-based abstraction, and lives at a lower level than XBlock. Content reuse relies on file conventions.
    So for instance, an entire static learning sequence might be stored together in one Content Bundle, but it would be broken up so that there is one file for the Sequence outline, and separate files for each Unit. Those Units could be included from a Sequence in a different Content Bundle. It's post-processed into something read-optimized before learners interact with it. Blockstore supports a kind of symlink functionality to enable using files from a different Bundle.
  3. Blockstore represents author intent and grouping. It favors author-friendliness even if it makes certain bookkeeping harder.
    A Content Bundle in Blockstore is something an author wants to edit, version, import, and export as a single thing. That means a Bundle can be a single problem or an entire sequence. Things stored in Blockstore are not read-optimized, and are not the data structure that students interact with in the end. The definitions of a mostly static learning sequence and a learning sequence with an adaptive component might look completely different when stored in Blockstore, even if the Learner eventually experiences them in a similar way. The imported and exported bundle that is a Content Bundle should be as author-friendly as possible – assets are grouped together with where they're used, and as few Blockstore concepts as possible should leak into how the content is written.
  4. Versioned content is the core of Blockstore, and plugin extensibility is focused around annotating that content.
    Things that create, transform, update, or execute content live outside Blockstore. Plugins know when content has been changed, but they don't modify content. Plugins maintain their own data and APIs. Plugin data changes can happen outside of the lifecycle of the content itself. This means that an export of the same version of a Content Bundle will always yield the same authored content, but may yield different plugin metadata (example: new tags that were added). Also, versions are meaningful, and not every edit of every file spawns a new version. A version is like a "commit" in that sense.

Blockstore is a lower-level system that enables the storage, versioning, tagging, and re-use of content. Much of the intelligence required to work with that content lives outside of Blockstore itself. So for instance, Blockstore has no concept of Units, Sequences, or XBlocks (side note: the name Blockstore makes a little less sense in this proposal), beyond the notion that there's a "type" called "olx/unit".

Layers

NameResponsibilities
Core

This is the storage of the content itself, and essential mechanisms for updating it. This layer doesn't understand anything about the actual files in the Content Bundle.

  • Content data models: Content Bundles, Versions, Links, Drafts (side note: it's possible that Drafts can move out of Core)
  • Roles
  • Signals
Persistence

Low level swappable piece that determines how we store the files for bundles.

  • Must support at least S3 and the local file system.
  • Extension point.
  • It'd be nice to use django-storages, but I'm not sure if it supports all the features we'll want, like setting headers dynamically (e.g. Content-Disposition).
  • Graceful handling of large assets matters.
Plugin

This is the more extensible layer that manages the constellation of metadata about a Bundle. It should be easy to add these over time, possibly in a separate repo. This layer actually does understand the contents, and might subscribe to events for particular content types.

  • Tagging
  • Search
    • The discovery aspect is interesting because XBlocks will exist in various Content Bundles at different granularities depending on the lifecycle/requirements. But I should still be able to say "What are all the capa problems in my Collection" regardless of whether they're standalone or in a Sequence.
  • Webhook notifications
  • Licensing
  • Dispatch
  • Import hooks (per-type) for things like validation.
Execution (external process)The XBlock runtime that actually executes content, probably at a Unit level. This lives in a separate process and needs to know how to grab content from Blockstore for the purpose of preview and authoring. 

Blockstore Core Layer Concepts

TermDefinition
Content Bundle

A group of files that are versioned together and can be accessed from other Content Bundles.

  • Blockstore stores a UUID and basic metadata about a Content Bundle, like a title, slug, and "type".
    • UUID never changes.
    • Slug must be unique within the Collection. Should be immutable...?
  • Blockstore does not parse or understand the actual contents of the files in Content Bundles, though it may know how to delegate certain actions like preview and editing based on type.
Bundle VersionAn immutable snapshot of a Content Bundle.
  • Once created, the contents for the files in a given Bundle Version does not change.
  • Metadata about a Version can change though, like tagging. That will often be asynchronously updated after a new Bundle Version has been created.
  • Creating a new Bundle Version emits a signal that other parts of the system will listen for, similar to how course_publish works today.
  • Importing creates a new Version directly (if any changes were made). It does not interact with Drafts.
  • Versions and even entire Content Bundles can be force deleted if necessary to to deal with copyright violations or other inappropriate content.
    • This will likely break any Content Bundles that has links to the deleted Content Bundle.
    • Force deletion may not be possible for content that is copied/referenced across Blockstore instances, but that entire use case is still very fuzzy right now.
Link

The method by which one Content Bundle references and uses content from another. Since a Content Bundle is a collection of files, this is conceptually a symlink between one Content Bundle Version and another, via a special named folder (e.g. `.blockstore/links/want_tutorial/videos/wand_demo.mp4`).

  • The version number is not encoded into the symlink path. This is so that updating to the next Version doesn't require changing the path.
  • A Link has:
    • An alias – the name of the symlink, essentially.
    • A target BundleVersion.
      • For now, all Links are assumed to be local to the server instance, but later we could add an optional server namespace field.
  • Since a Content Bundle is immutable, updating a Link results in a new Version.
  • Cycles are not allowed (bad things happen, like infinite version bumps).
  • A Content Bundle Version cannot have Links to multiple Versions of another Content Bundle. So A.v1 can make a Link to B.v1 or B.v2, but A.v1 cannot simultaneously access files from both B.v1 and B.v2.
  • Exporting multiple Content Bundles and preserving versioned Link information will likely require a smart client that knows how to pull down Content Bundle Versions to a shared namespace and then generate the necessary symlinks on disk. More on that later.
Version Range

A lot of annotation around a Content Bundle (e.g. tagging around teaching standards) is going to be about content that may change over time as new Versions are created. Associating them with the Content Bundle as a whole might be inaccurate. Associating them with specific Versions might be wasteful, particularly when the changes are relatively minor. Treating a Version Range as a first class concept would help to simplify data modeling in other parts of the system.

  • Version Ranges need a start version, but the end version can be null (i.e. open-ended).
Draft

A mutable space for changes to be made before they are committed to a Bundle Version.

  • There are no "draft" vs. "published" branches. Drafts get committed to Bundle Versions, and Bundle Versions keep increasing. Because versions are shared and immutable, there may be multiple Versions of a given Bundle that are live in different courses at any given point.
  • Import creates new Bundle Versions directly and does not interact with Drafts.
  • Under the covers, Drafts are copy-on-write.
  • Edits of Drafts are done at the individual file level, so it's possible for two people to be concurrently editing different units in the same sequence, as long as the units are separate files.
Collection

1:M grouping of Content Bundles.

  • There is no Collection-level versioning. It's a pointer to a bunch of Content Bundles which have their own versions.
  • Ownership is captured at the Collection level.
  • Permissions are determined at the Collection level.
  • Licensing is determined at the Collection level.
  • Mapping today's concepts:
    • A Course would have most of its content one Collection.
      • Multiple runs of the same course would be in the same Collection. So perhaps it's more accurate to that the contents of a CatalogCourse are in a Collection?
    • A Content Library would map to a Collection.
  • The Use Cases here are a bit different. Some of it overlaps with Discovery, but a lot would deal with management – what's outdated content, give me the list of all things that have property X, etc.
  • Import/export could happen at the Collection level, but in practice we'd want to very strongly lean towards allowing subsets of Collections to be imported or exported.

Questions:

  • Would this be problematic for changing licensing information over time? Does it need to be fixed to particular Content Bundle Versions?
  • Is there some notion of related Collections? A course might use a bunch of problems from a problem bank – they might be different Collections, but it seems useful to be able to associate them? Maybe that's overdoing it?
  • How do we delete Collections? Safe if empty? Safe if none of the Content Bundles are referenced externally?
Signal

We'll emit named Django signals for life cycle events around the Core layer, including:

  • Content Bundle creation and deletion.
  • Version creation and deletion.
    • Link creation and deletion (happens on Version creation)
  • Collection creation, update (e.g. title, ownership, roles) 

Higher Level Concepts

These concepts exist for the systems that author content in Blockstore, but Blockstore itself is unaware of them.

TermDefinition
Learning Context

A grouping against which student state is stored.

  • General state is stored via a (context, student, block) tuple.
  • If a student sees the same problem block in multiple places within a context, their state (answer, saved work) carries over.
  • If a student sees the same problem block in a different context, their state (answers, saved work) does not carry over.
  • Courses and Pathways reference Learning Contexts but they are not Learning Contexts themselves. The common case will be that each Pathway will use a separate Learning Context, but there may be situations in which you'll want multiple Pathways to use the same LearningContext.

Open Issues:

  • The case of XBlock state seems straightforward, but does all state really follow this (e.g. completion, scoring, gating, etc.)?
BlockThe smallest piece of content that can be authored, this maps to a leaf-node XBlock ("Component" in Studio) in edx-platform today.
Unit

An ordered list of Blocks that is the smallest chunk in which content can be consumed.

  • A single page worth of content.
  • Maps to a VerticalBlock today, but that name is horrible – we should either create a new Unit Block or make it an alias of Vertical (that's really what it should have been named in the first place). Making a new Block might allow us to drop support for some legacy cruft.
  • While Blocks can be almost anything, Units are expected to have some common fields like title, possibly an icon, a URL that can be independently rendered, etc.
Sequence

Sequences are a linear group of Units.

  • For existing courses, Sequences are largely static, though there are exceptions in randomized problems, A/B testing, and the adaptive use cases.

Why Files?

This proposal leans on the file system metaphor more strongly than the initial proposal. Some advantages to this:

  1. We can unify content storage and lifecycle updates.
    Studio currently stores authored course content in three different ways. Data that manifests as content and settings scoped XBlock field data are stored in the ModuleStore, as sort-of versioned documents in MongoDB. Smaller binary assets like images and PDF files are stored without versioning via the ContentStore interface, which writes to GridFS. For performance reasons, self-hosted videos are typically put in S3. Treating these in a more uniform way will help when building add-on functionality like search, tagging, and licensing. It should also lower cost and operational complexity.
  2. It simplifies the overall design.
    Blockstore offers the capability to store, version, and reuse content. The more it knows about the internals of that content, the more coupled it will be, making it hard to adapt and make changes. For example, adding a major feature like internationalization of OLX should require no changes to the core of Blockstore. Files also help with the partial sharing use cases (more on that in the Content Reuse section).
  3. Assets can get surprisingly sophisticated.
    Splitting the world into sophisticated XBlock-like data and simple binary blob assets is intuitive, but many assets are more sophisticated than they appear. A Video can be represented by a single mp4 file, or it can be a set of HLS files with different quality video encodings, multiple audio tracks, and subtitles for many languages. An ebook may come in three different formats, that are grouped, licensed, and updated in sync with each other. Even simple assets can become more complex when internationalization comes into play.
  4. It leads to more author-friendly grouping for reuse.
    OLX and static assets can live side by side in the same versioned Bundle, without requiring any external links. If there are precursor files like source LaTeX files that compile out to OLX, those can be stored in the same Bundle. They would be ignored by the XBlock runtime, but still be very valuable to store, version, and share.
  5. We need a serialization format for import and export anyway.
    Import and export are going to be a critical part of the supported workflow. This lets Blockstore be agnostic to OLX conventions that are handled at the XBlock runtime layer. The only file conventions it imposes are its own simple notions of Bundle metadata and Links.

Using Blockstore to Model Courseware

So what would the Content Bundles look like?

Content Bundle File Conventions

  • Blockstore exposes Bundle-level metadata as a .blockstore directory.
    • This folder a fully virtual folder (nothing actually exists there on S3, and it is optionally materialized on export)
  • Metadata is exported as JSON files.
  • All Links to other Content Bundles will be of the form .blockstore/links/{alias}
    • Link mapping is stored in the .blockstore/links.json directory.

Standalone Problem (leaf-level XBlock, appropriate for problem banks)

quicksort_complexity/
                     # Actual definition of the OLX for the problem.
                     problem.xml
 
                     # Convention: Files in static/ will be available to the browser during
                     # execution. For security reasons, we wouldn't just serve them straight
                     # from Blockstore, but the LMS will need to know what's okay to send to
                     # the browser vs. what might have solutions encoded into it.
                     static/         
                            diagram.png
 
                     # Blockstore metadata
                     .blockstore/
                                 info.json   # UUID, instance, title
                                 links.json  # Empty dictionary -- no links here

Video (centrally managed, probably in a separate Collection)

the_perfect_egg/
               # VideoModule OLX
               video.xml
               static/
                      hls/
                          playlist.m3u8
                          
                          # Dozens of alternate languages and encodings would
                          # be in the following directories:
                          audio/
                          subs/
                          video/

               links/
                     wand_tutorial_part_1/video.xml
                     wand_tutorial_part_2_feathers/video.xml
 
               # Blockstore metadata
               .blockstore/
                           info.json   # UUID, instance, title
                           links.json  # Empty dictionary -- no links here


Static Sequence (either standalone pathway or part of a Course)

Course Run Example

The core part of this would be some sort of document that has navigation information and provides pointers to all the different sequences that make up a course. For example, you could have a Content Bundle that defined an XML file that looks something like:

<course-navigation>
	<chapter name="Magic Basics">
		<!-- Policy related information like deadlines should live in a separate file, to make reuse easier. -->
        <!-- It's possible that we could treat the Course as its own Context, but it might be nice to be able
             to manually specify what context each sequence should be considered a part of. -->
		<sequence src="links/magic_source/sequence.xml"/>
		<exam src="links/magic_ethics_exam/exam.xml"/>
	</chapter>
	<chapter name="Magic History">
		<!-- etc. -->
	</chapter>
</course-navigation>

Some opinions that this approach has:

  • Sequences are separate entities, Chapters (and other hierarchy between Course and Sequences) are details of Course.
    In this particular mindset, "chapters" don't really exist as a separate entity, but only as a navigational convenience of Course. Everything is sequences and things that point to sequences. If a Course wants a flat list of sequences or a hierarchy an extra level deep – that's purely a Course Navigation concern. Unlike today, you wouldn't expect to be able to call a student_view on a chapter (that doesn't really work well in practice, but the framework currently allows for it).
  • Course navigation isn't an XBlock, and might not even be OLX.
    For content compatibility reasons, it's really important that the individual Units be OLX. Maybe the Sequences as well, though I think there's a decent case to be made either way there (and even if it is OLX, it doesn't necessarily have to be backed by an XBlock runtime). But the things-that-point-to-sequences don't necessarily need to be OLX. The data model for courses/chapters is a combination of a simple container hierarchy that can be translated to in a straightforward way, and a mess of global policy attributes that need to be moved out. It should have some reasonably human-readable serialized format, but whether that's XML, JSON, YAML, or something else that better fits our "X + overrides" use cases
  • This is not the LMS data representation.
    The way sequences are referenced here would be terribly inefficient if we had to make read calls to each sequence to get the titles to display. When we actually get to the representation in the LMS, we might want a uniform way to specify sequences that enables much more dynamic behavior by relying on queries of sequence relationships. But Blockstore is for authoring, and surfacing whatever is simplest and most intuitive for that.

Content Reuse

Content Bundles are created with the intended granularity of reuse. If you intend to have a bunch of problems for a problem bank, a Bundle is an individual problem. If you have a sequence that can be reused in multiple contexts, then the Bundle is your entire sequence. To use either of these, you would make a Link to it and reference the appropriate linked OLX file from a file in your own Bundle (e.g. a CourseRun definition referencing the above Sequence).

Reusing Content Bundles in their entirety is intended to be the common case, but it's possible that someone will want to reuse a small part of an existing Bundle. In that case, you can still use Links and reference a file directly, such as an individual Unit or image that's part of a larger sequence.

Reuse as designed for by the original author is done by referencing Bundles, and reuse in ways permitted, but not designed for by the original author is done by referencing specific files within Bundles.

Implementation Details

A non-exhaustive stab at some of the key implementation issues.

Storage

We have to be able to handle files that run from very small to extremely large. The metadata for Bundles will use the Django ORM, but the storage interface for Bundle files will be a pluggable backend that supports at least S3 and the local filesystem. Using django-storages would be nice, but I'm not entirely sure it supports all the features we'd want (e.g. "Content-Disposition" header support so we could reuse the same blob of data with different names).

Git or Mercurial have come up as possible backing stores as well. The main reason I avoid them is because of unknown operational complexity. At the granularity of the separate repositories, even a straightforward port of edx.org content could yield something on the order of a million repositories, with over 50 TB of video data. We already manage that in S3, but we have no experience running git repositories at this scale. Simple hosted AWS's EFS (their hosted NFS solution) performs horribly for git workloads, so we'd have to manage it ourselves. Our usage of video means that git-lfs is likely a requirement. None of which is insurmountable, but it raises the uncertainty and costs, and we likely won't take advantage of sufficient features to justify it.

Storage in an S3 store could look like:

  • /{bundle_uuid}/data/{file named after hash}
    All source data files are stored by hash, to allow for cheap renames across versions. URLs can be sent with custom content-disposition headers to enable browsers to download them with sensible filenames. Using the UUID as the start of the name makes us less likely to run into per-partition performance throttling from S3.

  • /{bundle_uuid}/versions/{version}/mapping.json

Q: What should be done in a situation where an asset was marked as public in an earlier version but private in a later version (or vice versa)?

Import/Export

Simple import and export of a single Bundle can be done with a tar.gz or zip file. But import and export that involves multiple Bundles at a time would benefit from a command line tool that could talk to our API – particularly if it wants to export a whole Course's worth of Bundles and preserve Links to other Bundles (sometimes even multiple versions of the same Bundles).

Scale and Schema

Current edx.org has approximately:

  • ~7K courses
  • ~5K content libraries
  • ~40 TB of content data
    • The vast majority of that is video at various encodings (including the raws)
    • ~2 TB of non-video static assets
    • ~400 GB of versioned XBlock content data

Our design goal would be to support a 100X increase in this, ideally without requiring partitioning, since that's not compatible with foreign key constraints. We've had operational experience with tables over 1B rows, but we probably don't want to push our design beyond that if possible.

ModelRows for edx.orgRows at 100X
Collection~10K~1M
Bundle~1M (~20 per course, more for content libraries, plus each video)~100M
BundleVersion~10M~1B

Limits we're going to set for Links and Files:

  • Max 100 Files per BundleVersion
  • Max 1,000 total dependencies for a BundleVersion (including dependencies of dependencies)

If we assume these limits, then naively creating rows on a per-BundleVersion basis will quickly explode the tables for Links and Files beyond what we want. One approach around this is to more smartly collapse redundant information across Bundle Versions, and another is to take the data out of the database entirely and into the file store.

BundleVersionFiles

Access patterns:

  • Common:
    • Get URL for single file in a BundleVersion.
    • Get names/URLs for all files for a given BundleVersion.
  • Less frequent:
    • What files changed in this BundleVersion?
    • What is the history of this file across all BundleVersions?

The files for a given BundleVersion will be tracked using a summary JSON file per BundleVersion, stored by the file store interface (i.e. in S3).

When I tried to model this in SQL with ranges of versions, I had difficulty making something straightforward that could guarantee that the same file name did not exist multiple times for the same BundleVersion (at least not in MySQL, given it lacks range types and support for CONSTRAINT...CHECK). We could flip the model to use (Bundle ID, filename) as the constraint, and have joined tables against that to account for versions, but that makes the very common "what files exist in this Bundle Version?" relatively expensive when there are many files.

A drawback here is that it's not easy to track things at a per-file level. On the bright side, it's really simple to implement and understand.

Link relationships are more complicated, because we expect to be able to query them in various ways:

  • Common:
    • What are all the Links that a given BundleVersion is using?
    • What Linked Bundles have been updated (have newer Bundle Versions)?
    • Can I add this Link without forming a cycle (needed anytime we add a Link)?
  • Reporting/Notifications:
    • What Bundles are using my Bundle?
    • What Versions of my Bundle are being used?
    • How many Bundles use my Bundle?

Avoiding Link Cycles

There are a few problems with cycles:

  • Infinite recursion when following links. This can be worked around by keeping a followed list and being mindful of the possibility, but as an unusual edge case, people would probably not account for it.
  • Infinite version bumping. Say there is a cycle between A and B. Then when B is updated, A will have the option of updating its Link to the new version of B. But doing that will bump the version of A, and B will be prompted to update it's Link to the new version of A.

Data Model

So with those constraints, the proposed design:

  • per-BundleVersion file describing the entire set of Link dependencies (including dependency-of-dependencies)
    • We'd probably want to cap it at some number of total dependencies, say 1,000 (some courses have over 500 videos).
    • When one BundleVersion adds another, this file is the only thing that needs to be inspected, since we've captured all transitive dependencies.
  • A table for Bundle Link relationships that has just enough information encoded in it to track basic usage and send notifications.
    • (Borrowing Bundle ID, Latest Is Using, Lending Bundle ID)
      • Does not encode the transitive dependencies.
      • "Latest Is Using" means that the latest version of borrowing bundle is still using the lending bundle.
        • So if I want to see what Bundles are using mine to see what needs notification, I query this table for Lending Bundle = My Bundle and LatestIsUsing == True.
  • Notifications and queryable usage at the Bundle level is in the relational database.
  • Cycle prevention and full dependency expansion happens in a file at the BundleVersion level (stored alongside other BundleVersion data).

Example of what a summary file for a BundleVersion might look like:

{
  // Metafile format version
  "_meta": {
    "version": 1
  },
  // Information about the BundleVersion
  "info": {
    "id": "HqzGTIKeTyWcFn0hWpXKBA",
    "version": "1",
    "title": ""
  },
  // Map paths to source files
  "files": {
    "private": {
      "course.xml": "d41d8cd98f00b204e9800998ecf8427e"
    },
    "public": {
      "syllabus.pdf": "94287380b7700b204e9800998ecf8421"
    }
  },
  // Links to the Sequences we're using (after @ is version). In
  // the Bundle, these end up as sub-dirs of links/
  "links": {
    "week_001": "47skF26fRayxQ73j48oFmA@1",
    "week_002": "rSTYnCDSSAi_uaxuDKfWYw@1",
    "week_003": "TWoW-EYESzK5bJlEiRx2yQ@1",
    "week_004": "kY6RjNJuSWuia9wUa3D1zg@1",
    "week_005": "MFHKfOP0Qeuuj-Y96Koi8g@1",
    "week_006": "Jd9_H8UVSbeDOTlWsFYgAA@1",
    "week_007": "mz_g-lJXQCCTLlm7ous-LA@1",
    "week_008": "1U0Wymd4TUSS4lGUgTPWCw@1",
    "week_009": "i7oIlRWRTGqY07f81Xbdpw@1",
    "week_010": "urcqDvH6QXqAhD_j-XwbwQ@1"
  },
  // Each Link has associated dependencies for all the things it
  // depends on (including dependencies of dependencies). Our own
  // dependencies are the union of all our Link dependencies.
  // * It's ok if different Links require different Versions of
  //   the same Bundle.
  // * No dependency can be added if any version of this Bundle
  //   is listed as one of its dependencies.
  //
  // The goal is to make dependency calculation and cycle
  // detection very fast when trying to add or update a Link.
  "dependencies": {
    "47skF26fRayxQ73j48oFmA@1": [
      "ruSUc2xjQESeyq0fcu0QRw@2",
      "rcHOkLaoSH-VmTspp1xahA@7",
      "xcdnuGWcT_WoN0ueyULCkA@10"
    ]
    // + a lot more
  }
}
  • No labels