Blockstore Proposal Comparison

The proposals:

The Blockstore proposals are fairly long, so this doc tries to summarize some of the principle differences.


Content Primitives

  • Identified by UUID.
  • Versioned numerically (1, 2, 3, etc.)
  • Tagging metadata is stored outside of the core Blockstore.

Differences: Granularity and Versioning

Original Proposal and Database Proposal

  • Files/Assets are tracked individually.
  • Units are tracked individually.

Original Proposal

  • In addition to per-Unit and per-File tracking, ContentSets (a group of Links) are also versioned.

File Proposal

  • ContentBundles are versioned as a whole, not individual assets inside them.
  • Depending on the intended usage, a ContentBundle could be a single video, a Unit, or an entire Sequence.

Differences: OLX vs. Assets

Original and Database Proposals

  • Separate models for Unit (i.e. a Studio Unit) and Files/Assets (e.g. images, PDFs, video files)
  • Unit OLX content is stored in the database.
  • Files/Assets live in an object store like S3, and are pointed to by rows in the database.
  • All metadata about Units and Assets are stored in the database.
  • Assets used by a Unit are tied together in the database using Links.
  • Advantages
    • Access to OLX has better latency guarantees, particularly for multi-gets.
    • Transactions make it easier to guarantee atomic operations involving many Units/Links/etc.
    • Able to track usage at a fine granularity (e.g. what are all the places this exact version of this image is used?) without requiring external indexing like Elasticsearch.

File Proposal

  • All content is stored in Content Bundles, which is like a small directory of files.
  • The OLX for a Unit would go into an XML file in a ContentBundle.
  • All Bundle content is stored in an S3-like object store.
  • Metadata about what content constitutes a particular version is in the object store, not the database.
  • Assets used by the Unit would go into the same ContentBundle.
  • Advantages
    • Units are more self contained.
    • Easier to adapt for use cases outside of Open edX, since ContentBundles don't assume an OLX/Assets divide.
    • Easier to associate bundles of related Assets, like a Video's various encodings, subtitles, thumbnails, etc.
    • Cheaper storage.

Differences: Modeling Sequences and Courses

Original Proposal

  • ContentSets are collections of Links that point to Units, Files, or other ContentSets.
  • Statically defined Sequences and Courses are composed using ContentSets.

Database Proposal

  • Sequences are out of scope – Blockstore's job is to provide fast access to the Units for a separate Compositor service.

File Proposal

  • A statically defined Sequence is modeled as a single ContentBundle, and versioned as a whole.
  • A Course would be a ContentBundle with a root OLX file defining the chapters and a set of Links to Sequences.

Links

  • Links are versioned in all proposals.
  • Conceptually like symlinks.

Differences: Scope of Usage

Original Proposal

  • Links are used to tie together Units and Files.
  • ContentSets tie together Units with each other, as well as with Files and other ContentSets.
  • Units, Files, and ContentSets are all considered "Linkables", and share a common interface that includes version history, tags, and draft status.
  • Links are stored in the database.

Database Proposal

  • Links are used to tie together Units and Files only.
  • Links are stored in the database.

File Proposal

  • Links are used a lot less, because Units and Sequences typically contain their own assets within the same ContentBundle.
  • A shallow, versionless representation of Links exists in the database for notification purposes, but full Link information is stored in the object store.
    • This is for scaling and performance reasons when dealing with large numbers of links and extended dependencies.
    • This makes it much harder to find out which things are using a specific Version of a given piece of content unless we index separately with something like ES.

Differences: Garbage Collection

Original and Database Proposals

  • Use Links in the database to garbage collect content that is outdated and is no longer being referenced.

File Proposal

  • Don't garbage collect.
    • Versioned OLX content is relatively small compared to the size of other assets stored in the object store.
    • It's not clear how we'd know what was being used in a multi-site distributed sharing arrangement.

Search & Tagging

None of the proposals really addresses this, but all of them assume that there will be an external system (either a plugin or separate service) that uses ElasticSearch as a backend.

Collections

  • All proposals center ownership, permissions, and licensing around the concept of a Collection.
  • Each piece of content belongs to exactly one Collection.
  • Examples:
    • single course (possibly multiple runs)
    • problem bank
    • library of videos created by a video team

Differences: Collection Versioning

Original and Database Proposals

  • Collections point to versioned content, and are also versioned as a whole.

File Proposal

  • Collections point to versioned content, but the Collection itself is not versioned.

Neither of these stances is fundamental to the designs.

Questions for Discussion

  1. What are the use cases for Collection-level versioning?
    1. Licensing version ranges for the Collection.


Meeting Notes

Braden, on File-based:

  • Strengths and Weakness: Flexibility
    • more future-proof
    • Braden concerns
      • Validation
        • DO: OLX validation has to happen in any approach. There is more flexibility in the file-based store though, so it might be harder.
        • BM: Includes and references as part of the XML
          • What depends on what specifically
            • Pulling out a problem from a Unit Bundle, what assets does that Bundle need?
      • Queryability
        • Does not capture fine grained dependencies (individual assets, units?)
          • Use case:
            • 20 PDFs that I link to.
            • If there are multiple Units that link to the same PDF, link bundle to each unit that needs it.
            • No way to know which PDFs are used where.
            • NA: Are we going to expose the concept of a Bundle to users?
            • BM: Solution: When you link to another Bundle, can specify list of file paths that you're using.
              • NA: Will need to layer this onto search functionality.
        • Keeping metadata in sync with S3 source of truth
          • Version information, extract names
          • BM: Would be useful to know what's in use and what's not.
            • Could track via S3, CDN logging
          • Use Case: Blockstore V2. Need to take horrible task of transforming all the data.
      • Doesn't offer many features
        • Course Structure (getting complete graph of course)
        • Versioning of Collections
          • Licensing changes use case?
          • Release Notes
        • Relies more heavily on search capability (since no DB query)
        • Can't do anything smart related to contents
          • BM:
            • Not sure if Blockstore is the right level for it, but XBlock data migrations.
            • OLX Validation in general. (plugin)
            • Having to do boilerplate to validate data integrity.
              • Does a file's contents match its name?
              • Does a bundle's contents match its declared type?
              • Does a reference to another bundle actually exist?
              • Does a version referenced actually exist?
  • Nimisha questions:
    • Theoretically decoupled from the fact that it's OLX, or that it's a filesystem...?
    • Blockstore used for read/write/edit, but read-optimized is a separate store?
      • Yes, read-optimized LMS store is separate
      • PP: Read-optimized store supporting adaptive learning?
        • BM: Have something higher level on top of Blockstore that stores Course Outlines
    • Search for re-use?
      • ES backed, separate module in Blockstore or external to it.


Decision: Moving forward with file-based proposal, but come back to and examine:

  • precursor files
  • extracting reusable problems from Units
  • static file references from within XBlocks, how those translate into LMS URLs
  • Search! Need to get to the bottom of this, use cases, requirements.
  • Garbage collection – is there a good way to do this?
  • Links simplification

Dave to summarize meeting decisions, consolidate wiki entries.

  • OEP 20
    • enumerate main decisions
    • fuller details can be in Confluence
  • Potentially separate OEPs for;
    • Blockstore
    • Layer on top of Blockstore / Mapping edx-platform content
    • Tagging
    • Search