Blockstore Proposal Comparison
The proposals:
- Harvard LabXchange Blockstore Proposal (Original)
- Blockstore Design (File) - We are moving forward with this proposal.
- Blockstore Implementation Proposal (Database)
The Blockstore proposals are fairly long, so this doc tries to summarize some of the principle differences.
Content Primitives
- Identified by UUID.
- Versioned numerically (1, 2, 3, etc.)
- Tagging metadata is stored outside of the core Blockstore.
Differences: Granularity and Versioning
Original Proposal and Database Proposal
- Files/Assets are tracked individually.
- Units are tracked individually.
Original Proposal
- In addition to per-Unit and per-File tracking, ContentSets (a group of Links) are also versioned.
File Proposal
- ContentBundles are versioned as a whole, not individual assets inside them.
- Depending on the intended usage, a ContentBundle could be a single video, a Unit, or an entire Sequence.
Differences: OLX vs. Assets
Original and Database Proposals
- Separate models for Unit (i.e. a Studio Unit) and Files/Assets (e.g. images, PDFs, video files)
- Unit OLX content is stored in the database.
- Files/Assets live in an object store like S3, and are pointed to by rows in the database.
- All metadata about Units and Assets are stored in the database.
- Assets used by a Unit are tied together in the database using Links.
- Advantages
- Access to OLX has better latency guarantees, particularly for multi-gets.
- Transactions make it easier to guarantee atomic operations involving many Units/Links/etc.
- Able to track usage at a fine granularity (e.g. what are all the places this exact version of this image is used?) without requiring external indexing like Elasticsearch.
File Proposal
- All content is stored in Content Bundles, which is like a small directory of files.
- The OLX for a Unit would go into an XML file in a ContentBundle.
- All Bundle content is stored in an S3-like object store.
- Metadata about what content constitutes a particular version is in the object store, not the database.
- Assets used by the Unit would go into the same ContentBundle.
- Advantages
- Units are more self contained.
- Easier to adapt for use cases outside of Open edX, since ContentBundles don't assume an OLX/Assets divide.
- Easier to associate bundles of related Assets, like a Video's various encodings, subtitles, thumbnails, etc.
- Cheaper storage.
Differences: Modeling Sequences and Courses
Original Proposal
- ContentSets are collections of Links that point to Units, Files, or other ContentSets.
- Statically defined Sequences and Courses are composed using ContentSets.
Database Proposal
- Sequences are out of scope – Blockstore's job is to provide fast access to the Units for a separate Compositor service.
File Proposal
- A statically defined Sequence is modeled as a single ContentBundle, and versioned as a whole.
- A Course would be a ContentBundle with a root OLX file defining the chapters and a set of Links to Sequences.
Links
- Links are versioned in all proposals.
- Conceptually like symlinks.
Differences: Scope of Usage
Original Proposal
- Links are used to tie together Units and Files.
- ContentSets tie together Units with each other, as well as with Files and other ContentSets.
- Units, Files, and ContentSets are all considered "Linkables", and share a common interface that includes version history, tags, and draft status.
- Links are stored in the database.
Database Proposal
- Links are used to tie together Units and Files only.
- Links are stored in the database.
File Proposal
- Links are used a lot less, because Units and Sequences typically contain their own assets within the same ContentBundle.
- A shallow, versionless representation of Links exists in the database for notification purposes, but full Link information is stored in the object store.
- This is for scaling and performance reasons when dealing with large numbers of links and extended dependencies.
- This makes it much harder to find out which things are using a specific Version of a given piece of content unless we index separately with something like ES.
Differences: Garbage Collection
Original and Database Proposals
- Use Links in the database to garbage collect content that is outdated and is no longer being referenced.
File Proposal
- Don't garbage collect.
- Versioned OLX content is relatively small compared to the size of other assets stored in the object store.
- It's not clear how we'd know what was being used in a multi-site distributed sharing arrangement.
Search & Tagging
None of the proposals really addresses this, but all of them assume that there will be an external system (either a plugin or separate service) that uses ElasticSearch as a backend.
Collections
- All proposals center ownership, permissions, and licensing around the concept of a Collection.
- Each piece of content belongs to exactly one Collection.
- Examples:
- single course (possibly multiple runs)
- problem bank
- library of videos created by a video team
Differences: Collection Versioning
Original and Database Proposals
- Collections point to versioned content, and are also versioned as a whole.
File Proposal
- Collections point to versioned content, but the Collection itself is not versioned.
Neither of these stances is fundamental to the designs.
Questions for Discussion
- What are the use cases for Collection-level versioning?
- Licensing version ranges for the Collection.
Meeting Notes
Braden, on File-based:
- Strengths and Weakness: Flexibility
- more future-proof
- Braden concerns
- Validation
- DO: OLX validation has to happen in any approach. There is more flexibility in the file-based store though, so it might be harder.
- BM: Includes and references as part of the XML
- What depends on what specifically
- Pulling out a problem from a Unit Bundle, what assets does that Bundle need?
- What depends on what specifically
- Queryability
- Does not capture fine grained dependencies (individual assets, units?)
- Use case:
- 20 PDFs that I link to.
- If there are multiple Units that link to the same PDF, link bundle to each unit that needs it.
- No way to know which PDFs are used where.
- NA: Are we going to expose the concept of a Bundle to users?
- BM: Solution: When you link to another Bundle, can specify list of file paths that you're using.
- NA: Will need to layer this onto search functionality.
- Use case:
- Keeping metadata in sync with S3 source of truth
- Version information, extract names
- BM: Would be useful to know what's in use and what's not.
- Could track via S3, CDN logging
- Use Case: Blockstore V2. Need to take horrible task of transforming all the data.
- Does not capture fine grained dependencies (individual assets, units?)
- Doesn't offer many features
- Course Structure (getting complete graph of course)
- Versioning of Collections
- Licensing changes use case?
- Release Notes
- Relies more heavily on search capability (since no DB query)
- Can't do anything smart related to contents
- BM:
- Not sure if Blockstore is the right level for it, but XBlock data migrations.
- OLX Validation in general. (plugin)
- Having to do boilerplate to validate data integrity.
- Does a file's contents match its name?
- Does a bundle's contents match its declared type?
- Does a reference to another bundle actually exist?
- Does a version referenced actually exist?
- BM:
- Validation
- Nimisha questions:
- Theoretically decoupled from the fact that it's OLX, or that it's a filesystem...?
- Blockstore used for read/write/edit, but read-optimized is a separate store?
- Yes, read-optimized LMS store is separate
- PP: Read-optimized store supporting adaptive learning?
- BM: Have something higher level on top of Blockstore that stores Course Outlines
- Search for re-use?
- ES backed, separate module in Blockstore or external to it.
Decision: Moving forward with file-based proposal, but come back to and examine:
- precursor files
- extracting reusable problems from Units
- static file references from within XBlocks, how those translate into LMS URLs
- Search! Need to get to the bottom of this, use cases, requirements.
- Garbage collection – is there a good way to do this?
- Links simplification
Dave to summarize meeting decisions, consolidate wiki entries.
- OEP 20
- enumerate main decisions
- fuller details can be in Confluence
- Potentially separate OEPs for;
- Blockstore
- Layer on top of Blockstore / Mapping edx-platform content
- Tagging
- Search