GridFS Replacement

Background

Open edX allows course authors to upload assets that are then served to students during the course. These assets can be nearly anything - text files, PDFs, audio/video clips, tarballs, etc. The edx-platform currently takes these uploads and stores them in the "contentstore". The contentstore is a MongoDB-backed GridFS instance. GridFS stores file data in chunks within a MongoDB instance, can store files greater than 16 MB in size, and can serve portions of large files instead of the whole file. An asset can be uploaded as "locked" or "unlocked". A locked asset is only available to students taking a particular course - the edx-platform checks the user's role before serving the file. Unlocked assets are served to any user upon request. When a student in a course requests an asset, the entire asset is served from GridFS.

However, since these assets are served by edx.org webservers, if many students using low bandwidth connections attempt asset downloads, a large number of available connections are used up for a long period of time by the asset downloads. This situation leaves only a limited number of remaining connections for students taking courses - and has caused the edx.org website to become unresponsive in the past. The workaround for this situation has been to move frequently-requested, mostly large assets out of GridFS and instead serve them from external servers via URLs. This workaround causes locked assets to become unlocked and creates a special, non-standard situation for certain assets. Clearly, a better solution is needed.

Plan

The plan is to change edx-platform to serve all static course assets from external storage using external webservers. On course asset upload, the uploaded asset will be uploaded to external storage by the edx-platform and a URL to the asset will be stored instead of storing the asset. On course asset download, the URL to the asset will be served to the client, which will then download the asset from the external storage directly. This change will allow edx.org webservers and their application to dedicate their connections and processing to course content serving and exercise-taking logic instead of asset serving.

Other parts of the edx-platform already use Amazon S3 via the Python S3 boto module. We plan to use S3 as well for the external storage and static asset server. Using S3 has the added advantage of possibly using the Amazon Cloudfront CDN to cache the assets close to the users requesting their download and thereby improving their course-taking experience.

Requirements

Below are the requirements of the new asset-serving design:

Below are the "optional" requirements of the design and implementation - these would be nice to have:

Relevant Existing Code

There's a singleton contentstore() that's used whenever an asset needs to be stored to / retrieved from GridFS. The contentstore code which connects to MongoDB and saves/finds/deletes files is here:

edx-platform/common/lib/xmodule/xmodule/contentstore/mongo.py

The contentstore is lazily created by code here:

edx-platform/common/lib/xmodule/xmodule/contentstore/django.py

The Mongo connection details are read from an appropriate environment PY file and are stored in the CONTENTSTORE variable. Those environment files are different for Studio and LMS:

The code which saves assets to the contentstore is in the _upload_asset() method in this file:

edx-platform/cms/djangoapps/contentstore/views/assets.py

The code which checks a user's role and serves them an asset is in the StaticContentServer().process_request() method in this file:

edx-platform/common/djangoapps/contentserver/middleware.py

An existing example of S3 upload in the edX codebase is here:

edx-platform/common/lib/xmodule/xmodule/open_ended_grading_classes/openendedchild.py

Questions and Considerations

1) The course author will upload a file to the edx-platform. Then the edx-platform will upload the file to Amazon S3. In the case of a large file, it's unclear both:

In the case of a failure to upload to S3, would we want to re-try the upload later?

2) We'll likely want a method of external storage in a test environment that is not S3. For example, a developer using a devstack instance shouldn't be required to set up an S3 account to get Studio running.

3) Mid-course updates

4) Locked asset security

5) Cloudfront

6) Different modulestores

7) Legal copyright concerns

Course Versioning and Asset Metadata

Split supports versioning of courses but Old Mongo and XML do not support versioning. A particular Split course version is stored in a "structure" document in the Split MongoDB collection. If all the asset metadata is stored in the structure document as well, then a new course version could refer to updated versions of the course assets or new assets altogether. Since it's possible and not difficult to support this asset versioning in Split, it seems desirable to support it.

Split versioning suggests that the Split modulestore should be responsible for storing asset metadata. And if Split does its own asset metadata storage, then Old Mongo should provide its own asset metadata storage as well. So this design will make each modulestore responsible for storing its own metadata.

Each modulestore would need to implement methods equivalent to the current contentstore for the querying of course asset metadata. Those methods include:

Split

Inside each course "structure" snapshot, fields would be added to record the mapping of external filename to internal name, whether the asset is locked, and version information for each asset. The new data would be similar to this snippet:

 

{_id: snapshot_version_id,
 ...
 storage: storage_location,
 assets: [
    {filename: external_filename, filepath: import_filepath, internal_name: internal_name, locked: bool, edit_info: {update_version, previous_version, edited_on, edited_by}},
    ...
  ],
 thumbnails: [
    {filename: external_filename, internal_name: internal_name}
  ],
 ...
}

 

This additional data will allow multiple course runs to share assets and will provide asset history. It also provides a quick way to get all of the assets ids for a course without asking the contentstore.

Old Mongo

The asset metadata would be stored in a separate collection in the modulestore MongoDB instance. Each course's assets will be contained in a separate document in the collection, keyed back to the course by the course_id. The asset documents would look similar to this snippet:

{course_id: <_id from course collection>,
 storage: storage_location,
 assets: [
	{ filename: external_filename, filepath: import_filepath, internal_name: internal_name, locked: bool, edit_info: {update_version, previous_version, edited_on, edited_by}},
    ...
  ],
 thumbnails: [
    {filename: external_filename, internal_name: internal_name}
  ],
 ...
}

 

XML

XML courses are fully contained in a disk-based tree. The course assets also live in this tree and are served from disk. To add assets to an XML course, it's a manual process where the course tree get updated with the new assets and module references to those assets. So XML course assets aren't in GridFS at all and are served via nginx via /static links. The work required to move all the XML course assets off disk and into S3 in order to serve them externally is outside the scope of this proposed work.

High-Level Design

Below is an accounting of the sequence of events that'll happen during asset upload/download.

Upload Asset Steps

(These steps are for client->edX->S3 upload.)

  1. The course author POSTs an asset via the Files & Uploads page in Studio.
  2. A thumbnail is generated (if appropriate).
  3. Generate MD5 hash for the asset.
  4. Generate an internal unique asset name (obj_name) for the asset using the MD5 hash.
    1. This obj_name will be used to store and retrieve the asset internally.
    2. Something like <filename>/<hash>.
  5. The asset (& thumbnail, if present) is stored in the appropriate storage for the environment.
    1. For prod, the storage would be S3.
    2. For other environments, the storage might be filesystem, GridFS, or some other simple store.
  6. On storage failure, report error back to course author.
  7. Store the asset obj_name as the file identifier.
    1. The storage of asset metadata would be done as a Django model in the DB.
    2. The code should store the asset metadata in the same place regardless of where the assets are stored - to simplify things.

Download Asset Steps

  1. An edX user requests an asset via GET.
  2. LMS re-directs the user to a different download link.
    1. Re-direct URL depends on the storage used, the asset lock status, and the user permissions.
  3. Download completes with success or failure.

New AssetStore API

AssetStore will parallel the current contentstore and will be a top-level, storage-agnostic interface to store asset data and metadata. Different storage implementations, such as S3, GridFS, and filesystem, will plug-in behind this common interface. 

Below is a sketch of the API:

Storage By Author

Download By User

New Storage Implementations

S3Store

Summary

Amazon's S3 is a web-service that stores files and allows their download via HTTP. To use the service, an AWS user creates one or more buckets in which objects are stored. Objects are referenced by name. For tons of info on S3, see the API reference. edX will use the AWS boto Python module for access to S3 - the module documentation is here.

Since the S3 storage is external and enabled by the external AWS, an asset download URL given to one user would work for any Internet user. To avoid locked asset access by non-participating users, the code would generate temporary URLs which expire after a configurable number of seconds. The boto.s3.key object has a method generate_url() that supports temporary URL generation.

One requirement for asset download is that the downloaded asset is saved to disk with the original filename. With S3, we can specify the filename directly by overriding the Content-Disposition header. The code would pass back the generated URL for the client to GET with the following parameters:

http://s3.amazonaws.com/<generated_url>?response-content-disposition=attachment%3B%20filename%3D<original_filename>

S3 can encrypt any objects you upload to a bucket. But we will not encrypt the assets - that security level is not a requirement and would add needless complexity. All asset files will be uploaded with a "public-read" canned ACL.

Here's the S3 API details:

Direct Client Upload vs. Upload via edX

Since the assets are stored externally with S3, there's technically no need to upload an asset to Studio and then have Studio upload the asset to S3. Direct client upload is supported by S3 - here's an article specifying how to do this direct upload. And here's a Django application that assists in direct S3 upload. There's pros and cons to each method - here's the cons:

  1. Course author asset upload will occur much less than course asset download - and long uploads will therefore take up fewer application connections than long direct downloads (which is the actual current problem).
  2. Common client code will become more complex - both for S3 storage and non-S3 storage implementations that are still supported.
  3. A S3 asset naming scheme which uses MD5 hash in the storage path will not be possible, since a hash can't be generated before S3 upload. The client might be capable of generating a hash but trusting the client to generate a proper hash is a security problem.
  4. Thumbnails will be unable to be generated for assets sent directly to S3 from the client. Again, the client might be able to generate a thumbnail but that's also a security risk.

And the pros:

  1. Faster asset upload completion - one long hop to S3 (client->S3) instead of one long hop and one short/medium hop (client->edX->S3).
  2. No worker threads tied up in long asset uploads.

Upload via edX w/Job-Based S3 Upload

There's also another option - upload via edX with job-based S3 upload. In this method, an author uploads an asset to Studio and then Studio queues up an upload job to S3 via Celery. The asset then enters an "in_progress" state. After job completion, the asset would show up as a normal course asset on success. On failure, it'd retry the upload a configurable number of times. If the S3 upload still failed after retries, the author would be notified somehow. It'd have some advantages over regular upload via edX, like not tying up a gunicorn worker thread during S3 upload. But, unfortunately, this method is equal or higher in complexity than direct client upload, due to:

FileStore

Django can store files and serve them from the local filesystem. This implementation would use the local filesystem. 

Here's the FileStore API details:

In a multi-webserver environment, the asset root would need to point to shared storage in order for the saved asset files to be served from all webservers.

MongoStore

The GridFS implementation is wrapped behind the current contentstore() singleton. It stores the actual assets and the metadata about assets in two different collections. Ideally, we'd re-use the contentstore() code and wrap it with the new AssetStore API. The big difference is that the asset metadata will no longer be stored in the contentstore - it'll instead be stored in the appropriate modulestore for each course.

Here's the MongoStore API details:

Course Storage Implementation Configuration

A created course's asset storage implementation will default to either GridFS or filesystem upon course creation. All assets for a particular org/course will be stored in the same location. The configuration which determines which asset storage is used for a course will live in the same place in the modulestore as the asset metadata. To switch a course's asset store to S3 (or back to GridFS or filesystem), a privileged site user (not the course author!) will interact with a new Django admin interface ideally. As a preliminary step, we'd implement a manage.py option to switch a course between asset stores. A course author is unlikely to care where their course's assets are stored and served from - DevOps will be the primary user of this interface. Also, changing the asset store implementation should likely clear out all the asset information in preparation for the course's first assets to be uploaded or all the course's asset to be imported to the correct, newly chosen location.

Course Import & Export

When a course author exports an edX course, it's designed to be a self-contained package of all the course content, including all course assets. If the exported course package is then imported in a different edX instance, all course content including assets should be available to course participants. This requirement implies that all course assets need to be contained in the exported course package, irregardless of the storage location of the course assets.

Course Export

According to the new design, there are three different locations where assets can be stored. If a course stores its assets in GridFS or on the local filesystem, then the assets can be quickly retrieved and added to the course tarball for download. If a course stores its assets on S3, then the edX app will need to download each asset from S3 in order to add all assets to the tarball for download. Depending on the total size of all course assets, this download could potentially take a few minutes. For AWS-hosted edX instances, this download time would hopefully be minimal, as the file transfers would happen internal to AWS. Course export time might be increased when using S3 storage - meaning a greater need to shift course export to a background task. NOTE: According to edX DevOps, the largest total size of course assets is currently around 1GB. But most course have a total asset size of tens of MB.

If course assets were imported under a particular filepath, then those assets will have the same filepath upon export. This proper placement will be achieved using the asset metadata, which will store the original filepath along with other metadata.

Course Import

Before a course is imported, an asset storage configuration must be chosen which is one of the three available specific storage implementations - filesystem, GridFS, or S3. Upon course import, the import process saves each asset to the configured storage. If the chosen storage is filesystem or GridFS, then the assets can be stored in a relatively short time. If the chosen storage is S3, then each asset will need to be uploaded to S3 (if not already present). The S3 upload could potentially take a significant amount of time, depending on the total size of all course assets. edX has near-term plans to shift course import to a background task to help with issues around a lengthy course import.

If course assets exist under a particular filepath in the import tree, then the filepath will be captured in the asset metadata and preserved if the asset is later imported.

Migration Plan

A method for migrating a course from using one kind of storage implementation to a different one is also needed. To accomplish this migration, here's the proposed procedure:

  1. Export the course to local disk.
  2. Change the storage implementation configuration for the course.
  3. Import the exported course.

Using export/import for asset storage migration would avoid the need for a separate migration script that'd need to be developed and maintained separately. Course import and export are required features that will be maintained - migration would simply use the same mechanism.

User Stories and Rollout Plan

The GridFS replacement work will be tracked using the JIRA Epic tasks listed below:

Here's the user stories written for this work which will be added to the Epic stories above - some of these will be converted to Acceptance Criteria instead:

Here's a high-level rollout plan for the work: