Objective

Document a detailed final design for storing and retrieving course assets. The final design will help inform the phased-in steps along the way - so it's important that we're in agreement on it. The initial design is located here: GridFS Replacement

Asset Write (App Upload) Flow

Diagram

Description

  1. Course author uploads a course asset via Studio. The application code calls the "Asset Manager" (AssetMgr) - this is a replacement name for the Contentstore Re-direct (CSR) name we've been using until now.
    1. Params:
      1. CourseKey
      2. actual asset bytes
      3. uploaded filepath (filename + path to file)
      4. locked/unlocked attribute
      5. possibly other attributes
  2. The AssetMgr looks up the StorageID used for the course in the modulestore.
    1. The CourseKey is used to look up the StorageID.
    2. It's stored with all the assets in the relevant modulestore.
  3. The AssetMgr calls the BlobStore API call to store an asset.
    1. store_blob - API method that can be used platform-wide to store blobs
    2. Params:
      1. StorageID used for course
      2. actual blob bytes
      3. file pathname
      4. additional blob attributes
  4. Using the StorageID, the proper storage impl's store_blob() is called.
    1. At this point, the particular storage impl takes over.
    2. It must:
      1. Save the file bytes or identify an existing copy of the file.
      2. Send back a BlobLocator that can be used to find that file in the relevant storage.
        1. The StorageID should be introspectable from the BlobLocator.
    3. It can (optionally):
      1. Send back a version number for the saved file.
      2. Save introspection data of some sort along with the file (if the storage supports it).
      3. Send back whatever fields are needed/desired to be saved along with the asset metadata.
  5. The asset file locator (BlobLocator) is sent back to the AssetStore API.
    1. BlobLocators will provide common properties:
      1. storage: StorageID
      2. version: string which can be something relevant or None or empty
      3. filepath: string identifying path to file, including filename
      4. (maybe) hash: string representing file hash
      5. (maybe) hash_type: string representing hash algorithm used to generate hash
      6. other_props: dict of properties to save with metadata and send back whenever accessing asset
  6. The BlobLocator is returned from the BlobStore API back to the AssetMgr.
    1. If the file save was unsuccessful, end here and return error result to application.
    2. NOTE: The BlobLocator can be used to retrieve the asset at this point.
      1. The BlobStore API works independently of the rest of this flow.
      2. This point is important! It'll enable other parts of the application to store course things, like ORA data and such.
  7. The AssetMetadata is sent to the Mixed Modulestore.
    1. An AssetMetadata object contains an BlobLocator - the one sent back from the BlobStore API.
    2. But it also contains the CourseKey and other attributes sent from the application.
  8. The Mixed Modulestore knows which modulestore is used for the course.
    1. It forwards the AssetMetadata on to the appropriate modulestore.
    2. The modulestore saves the asset metadata in the appropriate location.
      1. This is the subject of my pending PR: https://github.com/edx/edx-platform/pull/4854
    3. If asset metadata is already stored for the particular asset, it's updated appropriately.
  9. The result of saving the asset metadata is returned to the Mixed Modulestore.
    1. It's possible that XML-backed courses will always return error here?
  10. Mixed modulestore returns the result back to the AssetMgr.
    1. If the asset was stored successfully but the metadata save failed, perhaps delete the stored asset.
  11. The total asset storage result is sent back to the calling application.

Asset Read Flow

Diagram

Description

  1. Student requests a course asset. Application calls the AssetMgr with an AssetKey.
  2. AssetKey is sent to the Mixed Modulestore.
    1. It knows the modulestore that's used for the asset's course.
  3. Search for the AssetMetadata in the appropriate modulestore.
    1. Also the subject (for old Mongo) of my pending PR: https://github.com/edx/edx-platform/pull/4854
  4. Return the AssetMetadata to the Mixed Modulestore.
  5. Return the AssetMetadata to the AssetMgr.
    1. Check the asset's attributes in the metadata.
    2. If an asset is locked, the AssetMgr checks the user access permissions to see if access is allowed.
      1. If access is not allowed, return a 404.
    3. Extract the blob_key (BlobLocator) from the asset metadata.
    4. If no asset metadata was found for the requested asset:
      1. Form a BlobLocator containing a SON object pointing to the GridFS-stored asset.
        1. This step yields backwards-compatibility with the contentstore-stored assets.
  6. AssetMgr calls the BlobStore API get_asset(blob_key, ttl_seconds) call.
    1. The blob_key allows introspection of the relevant storage impl.
    2. The ttl_seconds param is relevant when the asset is locked.
      1. That's how the locked attribute is communicated forward.
  7. BlobStore API calls the relevant storage impl.
    1. It knows which storage impl should have the asset via the StorageID in the BlobLocator.
  8. The storage impl does one of two things. Either a) or b):
    1. generates a URL which can be used to access the course asset.
      1. This can be an internal c4x URL or an external AWS URL ( or a shortened "pretty" URL).
      2. Blobs stored externally (like S3) or accessible via the webserver (like filesystem) have URLs returned - not the actual blobs.
    2. returns the actual blob bytes
      1. Blobs stored internally with no external access allowed have bytes returned - like GridFS.
  9. BlobStore API returns URL or bytes to the AssetMgr.
  10. AssetMgr returns result to application.
    1. If URL, AssetMgr redirects the student to the returned URL.
    2. If bytes, then done.

Other Course Asset Metadata Storage

The modulestore API used to store course asset metadata has been written to allow other systems to store asset metadata of other types, such as video. Those systems would not use the AssetMgr but would instead access the modulestore directly via the Mixed Modulestore, as shown below. However, the systems which store asset metadata in the system are responsible for their own storage & tracking and for handling access authentication if locked assets are desired.

BlobStore API

store_blob

Method which stores the bytes of a blob and returns a BlobLocator that allows the blob to be retrieved later.

Parameters:

Returns:

append_to_blob

Method which appends more bytes to an already-existing blob.

Parameters:

Returns:

get_blob

Method which retrieves a blob's bytes.

Parameters:

Returns:

get_blob_url

Method which retrieves a blob's URL which can be used to directly retrieve the blob.

Parameters:

Returns:

Questions To Resolve:

Work Progression

Transition Plan

Initially all assets are in GridFS - no asset metadata exists in the modulestore.

Phase One:

Phase Two:

Phase Three:

The largest leap. BlobStore API is complete and GridFS storage implementation is functional. Asset reads/writes work as follows:

Asset Reads

Assets reads will read the assets from GridFS exactly as they are stored today. No conversion (active or lazy) or import/export needs to take place.

  1. AssetMgr attempts to find the asset in the asset metadata.
  2. When not found, the AssetMgr forms a BlobLocator itself.
    1. Has a StorageID of "gridfs"
    2. Storage-impl-only details have a SON object (or the info needed to construct a SON object).
  3. Constructed BlobLocator is sent over to BlobStore API in get_blob().
  4. BlobStore API routes the BlobLocator and get_blob() call to the GridFS pluggable impl.
  5. The GridFS impl uses the SON object to access the blob bytes.
  6. Returns the bytes.

Asset Writes

Asset writes will function just as they eventually will (see diagram above). However, new assets stored in GridFS will pass back a BlobLocator with a MongoDB document ObjectId instead of a SON object. The GridFS storage impl needs to be able to query blobs using both.

Studio Asset List

Asset metadata will now be stored in two different locations - contentstore and modulestore. When a list of all course assets is needed, as in the Files & Uploads view in Studio, asset metadata will need to be pulled from both places, combined, sorted, and paged through.

Course Import/Export

Course import will cause all asset metadata to be stored in the modulestore only, with the actual bytes stored in the BlobStore (GridFS storage impl) - no assets will remain in the contentstore after import. 

Course export will need to pull assets from both the modulestore/BlobStore and the contentstore for storage in the course tarball.

Course Reruns

As in course imports, course reruns will cause all asset metadata to be stored in the modulestore only, with the actual bytes stored in the BlobStore (GridFS storage impl) - no asset will remain in the contentstore after a rerun. 

Phase Four:

S3 storage implementation is functional. Global and per-course storage impl preference is implemented.

Asset Reads

Asset reads function as above - but might also be stored in S3 now.

Asset Writes

Asset writes can be configured to go to S3 (per-course or globally).

Studio Asset Listing

The asset listing will function as in Phase 3 above - looking for assets in both the modulestore/BlobStore and the contentstore.

Course Import/Export

Course import will cause all asset metadata to be stored in the modulestore only, with the actual bytes in the BlobStore GridFS or S3 storage impls - whichever one is configured as the course's default storage impl. No assets will remain in the contentstore after import.

Course export will function as in Phase 3 - pulling from both the modulestore/BlobStore and the contentstore.

Course Reruns

As in course imports, course reruns will cause all asset metadata to be stored in the modulestore only, with the actual bytes in the BlobStore GridFS or S3 storage impls - whichever one is configured as the course's default storage impl. No assets will remain in the contentstore after import.

Asset Metadata in Course XML

One implication to the course asset metadata moving from the contentstore to the modulestore is that the format used to store the asset metadata in an exported course will change. Currently, all the asset metadata in the contentstore is exported to an assets.json file and stored in the top-level "policies" directory of an exported course. The modulestore-stored asset metadata will no longer export to that location. Instead, it'll be exported to XML directly. An exported course will potentially have asset metadata in both XML and in the assets.json file - course import will need to read from both locations. The assets themselves will remain in the top-level "static" directory of an exported course.

Asset XML Format

The XML format of the asset metadata will be formalized by an XML schema definition (XSD), which follows:

<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="assets" type="assetListType" />

<xs:simpleType name="stringType">
  <xs:restriction base="xs:string"/>
</xs:simpleType>

<xs:simpleType name="userIdType">
  <xs:restriction base="xs:nonNegativeInteger"/>
</xs:simpleType>

<xs:simpleType name="datetimeType">
  <xs:restriction base="xs:dateTime"/>
</xs:simpleType>

<xs:simpleType name="boolType">
  <xs:restriction base="xs:boolean"/>
</xs:simpleType>

<xs:complexType name="assetListType">
  <xs:sequence>
    <xs:element name="asset" type="assetType" minOccurs="0" maxOccurs="unbounded" />
  </xs:sequence>
</xs:complexType>

<xs:complexType name="assetType">
  <xs:all>
    <xs:element name="asset_id" type="stringType"/>
    <xs:element name="contenttype" type="stringType"/>
    <xs:element name="basename" type="stringType"/>
    <xs:element name="internal_name" type="stringType"/>
    <xs:element name="locked" type="boolType"/>
    <xs:element name="thumbnail" type="stringType" minOccurs="0"/>
    <xs:element name="created_on" type="datetimeType" />
    <xs:element name="created_by" type="userIdType" />
    <xs:element name="created_by_email" type="stringType" minOccurs="0"/>
    <xs:element name="edited_on" type="datetimeType" />
    <xs:element name="edited_by" type="userIdType" />
    <xs:element name="edited_by_email" type="stringType" minOccurs="0"/>
    <xs:element name="prev_version" type="stringType"/>
    <xs:element name="curr_version" type="stringType"/>
    <xs:element name="fields" type="stringType" minOccurs="0"/>
  </xs:all>
</xs:complexType>

</xs:schema>

The XSD above formalizes the XML structure shown below:

<assets>
    <asset>
        <asset_id>AssetKey("pic1.jpg")</asset_id>
        <contenttype>None</contenttype>
        <curr_version>14</curr_version>
        <basename>pix/archive</basename>
        <edited_on>2014-11-17T21:02:38</edited_on>
        <created_on>2014-11-17T21:02:38</created_on>
        <created_by>14</created_by>
        <prev_version>13</prev_version>
        <edited_by>14</edited_by>
        <internal_name>EKMND332DDBK</internal_name>
        <fields>{"size": 8686848, "copyrighted": 0}</fields>
        <locked>false</locked>
        <thumbnail>None</thumbnail>
    </asset>
...
</assets>

 

Further Notes