GridFS Replacement

Background

Open edX allows course authors to upload assets that are then served to students during the course. These assets can be nearly anything - text files, PDFs, audio/video clips, tarballs, etc. The edx-platform currently takes these uploads and stores them in the "contentstore". The contentstore is a MongoDB-backed GridFS instance. GridFS stores file data in chunks within a MongoDB instance, can store files greater than 16 MB in size, and can serve portions of large files instead of the whole file. An asset can be uploaded as "locked" or "unlocked". A locked asset is only available to students taking a particular course - the edx-platform checks the user's role before serving the file. Unlocked assets are served to any user upon request. When a student in a course requests an asset, the entire asset is served from GridFS.

However, since these assets are served by edx.org webservers, if many students using low bandwidth connections attempt asset downloads, a large number of available connections are used up for a long period of time by the asset downloads. This situation leaves only a limited number of remaining connections for students taking courses - and has caused the edx.org website to become unresponsive in the past. The workaround for this situation has been to move frequently-requested, mostly large assets out of GridFS and instead serve them from external servers via URLs. This workaround causes locked assets to become unlocked and creates a special, non-standard situation for certain assets. Clearly, a better solution is needed.

Plan

The plan is to change edx-platform to serve all static course assets from external storage using external webservers. On course asset upload, the uploaded asset will be uploaded to external storage by the edx-platform and a URL to the asset will be stored instead of storing the asset. On course asset download, the URL to the asset will be served to the client, which will then download the asset from the external storage directly. This change will allow edx.org webservers and their application to dedicate their connections and processing to course content serving and exercise-taking logic instead of asset serving.

Other parts of the edx-platform already use Amazon S3 via the Python S3 boto module. We plan to use S3 as well for the external storage and static asset server. Using S3 has the added advantage of possibly using the Amazon Cloudfront CDN to cache the assets close to the users requesting their download and thereby improving their course-taking experience.

Requirements

Below are the requirements of the new asset-serving design:

Locked & unlocked assets
- An asset can be locked or unlocked by a course author.
- Locked assets are only accessible to users with course access.
- Unlocked assets are accessible to all.
When a course creator updates an asset during a course, the new version of the asset should be served.
- Preserve previous versions of course assets for previous course versions.
Filename persistence
- The asset filename is preserved in the Studio view of all assets and is part of the asset's URL (static/<filename>)
- The asset downloads to a user as the same filename it had when it was uploaded to Studio.
- Also, if an asset is XML-imported with a filepath, the same filepath is present in a course export.
  - For example, a course asset with a filepath of handouts/handout1.txt will be downloaded by a student as handout1.txt.
  - But on course export, the asset will exist at the same filepath of handouts/handout1.txt.
S3 storage of assets shouldn't be forced.
- For a development instance, GridFS storage or filesystem storage of assets should still be supported.
- As a guiding principle, we should have appropriate abstractions in place whenever we rely on AWS specific components to enable others to plug in other implementations.
Course import/export
- Export should export all assets to the course tree, no matter where the assets are stored.
- Import should store all assets to the configuring asset storage upon course import.
A course author should be able to give out a permanent URL to a course asset.
- Irregardless of whether the course asset is locked or unlocked.
Changing course asset URLs should not break analytics.
- For both internal edX and partner (MITx) analytics.
- A method to determine how many times each course asset has been served should still be available, even if different.
Assets that are copyrighted or are in danger of copyright violation should be able to be marked as such.
- These marked assets should be easily searched.
- Asset metadata should support this flag.

Below are the "optional" requirements of the design and implementation - these would be nice to have:

Eliminate the need to copy all course assets when a course is rerun.
- Instead, use the assets from the previous course run (with same filenames / MD5 hashes).
- This requirement might be abandoned in order to simplify the implementation.
Implementation of external storage should unify all external storage - including ORA things.
- Currently, ORA performs its own uploads ad-hoc.
- Those uploads are common enough to use the same generic storage API.
Serve short, easily-readable URLs to course assets - no matter where they're stored.
- Especially nice for permanent URLs given out by course authors.

Relevant Existing Code

There's a singleton contentstore() that's used whenever an asset needs to be stored to / retrieved from GridFS. The contentstore code which connects to MongoDB and saves/finds/deletes files is here:

edx-platform/common/lib/xmodule/xmodule/contentstore/mongo.py

The contentstore is lazily created by code here:

edx-platform/common/lib/xmodule/xmodule/contentstore/django.py

The Mongo connection details are read from an appropriate environment PY file and are stored in the CONTENTSTORE variable. Those environment files are different for Studio and LMS:

edx-platform/cms/envs
edx-platform/lms/envs

The code which saves assets to the contentstore is in the _upload_asset() method in this file:

edx-platform/cms/djangoapps/contentstore/views/assets.py

The code which checks a user's role and serves them an asset is in the StaticContentServer().process_request() method in this file:

edx-platform/common/djangoapps/contentserver/middleware.py

An existing example of S3 upload in the edX codebase is here:

edx-platform/common/lib/xmodule/xmodule/open_ended_grading_classes/openendedchild.py

Questions and Considerations

1) The course author will upload a file to the edx-platform. Then the edx-platform will upload the file to Amazon S3. In the case of a large file, it's unclear both:

how long the upload will take to complete successfully (or fail)
how reliable the S3 upload will be?

In the case of a failure to upload to S3, would we want to re-try the upload later?

Where will the uploaded asset be stored in the interim?
- GridFS used as temporary asset storage only?
Maybe we'll want to queue the upload as a asynchronous job?
- The job would complete with a success or failure.
- A success would move the asset upload state to complete and it'd become a real, downloadable asset shown as such in the Files & Uploads dashboard.
- A failure would cause a re-queueing of the upload job at some point in the future (or not, depending on the failure).
  - The asset status would be shown as in-progress in the Files & Uploads dashboard.
- Are we in agreement that an asynchronous job is appropriate here?
  - Or would a simple upload to S3 after an upload to the app be good enough?
CURRENT THINKING: Client -> edX > S3 upload with no asynchronous job minimizes complexity and fixes the current problem - leaning towards that solution.

2) We'll likely want a method of external storage in a test environment that is not S3. For example, a developer using a devstack instance shouldn't be required to set up an S3 account to get Studio running.

Have other parts of the code had to deal with this? How have they dealt with it?
- I've seen S3 upload code in open response question upload application. What does that code do without a S3 instance?
CURRENT THINKING: The design will support different storage implementations - the impl used will be specified in each course's configuration.

3) Mid-course updates

Suppose an asset used by a particular XBlock is updated during a course run.
- If a student goes back to that XBlock, do they see the old asset version or the new version?
- IMO: I'm inclined to always serve the newest asset.
  - If the course author deletes an asset, we wouldn't still serve the asset, right?
- This versioning issue influences design greatly - and needs resolution before deep design can be done.
  - If versioning is supported:
    - Different versioned files will need different unique names.
    - There will need to be links between versions of the same asset and between versioned assets and the XBlock that uses them.
CURRENT THINKING: See the Course Versioning and Asset Metadata section below.

4) Locked asset security

We'll be storing a URL for an external resource and serving that URL only to users with proper course access.
- This technique works fine when the edX app is serving the file directly.
- For external storage (S3), instead create a temporary, short-lived URL for each asset request and serve that URL.
  - Is that enough security?
  - How long should the URL live?
CURRENT THINKING: The short-lived URLs plan is adequate security.

5) Cloudfront

Should this work go ahead and integrate Cloudfront CDN usage for the assets?
- Might not be compatible with handing out temporary URLs for locked assets?...
- Seems like this integration is possible - but definitely non-trivial to setup.
  - This page has lots of good info.
Also, what about other CDNs with better international support?
Or international CDNs? Should we support them?
CURRENT THINKING: In order of priority, attempt to:
- Use Cloudfront for unlocked assets.
- Use Cloudfront for locked assets.
- Implement CDN support in a way that keeps multiple CDN support a future option.

6) Different modulestores

How does this design map across the three different modulestores? (XML, Old Mongo, Split)
Is a single storage interface (like contentstore) sufficient for all modulestores?
Split supports versioning of courses but Old Mongo and XML do not.
- A new course version could refer to new versions of the course assets if the course asset info is stored in the Split structure instance.
- It seems desirable to support asset versioning in Split since it's possible to do so.
- Split versioning suggests that the Split modulestore should be responsible for storing asset metadata.
- And if Split does its own asset metadata storage, then Old Mongo should provide its own storage as well.
CURRENT THINKING: Each modulestore will store its own asset metadata.

7) Legal copyright concerns

It's possible that a course asset could be targeted as a copyright violation and issued a takedown order.
What typically happens in these situations?
- Is our AWS account in danger of being banned?
- Could we lose all access to our AWS bucket?
- Based on possible scenarios, what's the proper S3 account/bucket/object partitioning to minimize risk?
CURRENT THINKING: Add a "copyrighted" field to the asset metadata to make all copyrighted assets easily searchable.
- Reasoning: If we make potential copyright violations easy to takedown upon request, risk is minimized and the need for more complex solutions is reduced.
- More complex solution: Implement S3 configuration in such a way to allow different courses to use different AWS accounts and/or different AWS buckets in the future.
  - Don't support/implement/test this functionality at this time.

Course Versioning and Asset Metadata

Split supports versioning of courses but Old Mongo and XML do not support versioning. A particular Split course version is stored in a "structure" document in the Split MongoDB collection. If all the asset metadata is stored in the structure document as well, then a new course version could refer to updated versions of the course assets or new assets altogether. Since it's possible and not difficult to support this asset versioning in Split, it seems desirable to support it.

Split versioning suggests that the Split modulestore should be responsible for storing asset metadata. And if Split does its own asset metadata storage, then Old Mongo should provide its own asset metadata storage as well. So this design will make each modulestore responsible for storing its own metadata.

Each modulestore would need to implement methods equivalent to the current contentstore for the querying of course asset metadata. Those methods include:

save/delete/find
set_attr(s)/get_attr(s)
- for example, setting assets locked/unlocked
get_all_content_for_course

Split

Inside each course "structure" snapshot, fields would be added to record the mapping of external filename to internal name, whether the asset is locked, and version information for each asset. The new data would be similar to this snippet:

{_id: snapshot_version_id,
 ...
 storage: storage_location,
 assets: [
    {filename: external_filename, filepath: import_filepath, internal_name: internal_name, locked: bool, edit_info: {update_version, previous_version, edited_on, edited_by}},
    ...
  ],
 thumbnails: [
    {filename: external_filename, internal_name: internal_name}
  ],
 ...
}

This additional data will allow multiple course runs to share assets and will provide asset history. It also provides a quick way to get all of the assets ids for a course without asking the contentstore.

Old Mongo

The asset metadata would be stored in a separate collection in the modulestore MongoDB instance. Each course's assets will be contained in a separate document in the collection, keyed back to the course by the course_id. The asset documents would look similar to this snippet:

{course_id: <_id from course collection>,
 storage: storage_location,
 assets: [
	{ filename: external_filename, filepath: import_filepath, internal_name: internal_name, locked: bool, edit_info: {update_version, previous_version, edited_on, edited_by}},
    ...
  ],
 thumbnails: [
    {filename: external_filename, internal_name: internal_name}
  ],
 ...
}

XML

XML courses are fully contained in a disk-based tree. The course assets also live in this tree and are served from disk. To add assets to an XML course, it's a manual process where the course tree get updated with the new assets and module references to those assets. So XML course assets aren't in GridFS at all and are served via nginx via /static links. The work required to move all the XML course assets off disk and into S3 in order to serve them externally is outside the scope of this proposed work.

High-Level Design

Below is an accounting of the sequence of events that'll happen during asset upload/download.

Upload Asset Steps

(These steps are for client->edX->S3 upload.)

The course author POSTs an asset via the Files & Uploads page in Studio.
A thumbnail is generated (if appropriate).
Generate MD5 hash for the asset.
Generate an internal unique asset name (obj_name) for the asset using the MD5 hash.
1. This obj_name will be used to store and retrieve the asset internally.
2. Something like <filename>/<hash>.
The asset (& thumbnail, if present) is stored in the appropriate storage for the environment.
1. For prod, the storage would be S3.
2. For other environments, the storage might be filesystem, GridFS, or some other simple store.
On storage failure, report error back to course author.
Store the asset obj_name as the file identifier.
1. The storage of asset metadata would be done as a Django model in the DB.
2. The code should store the asset metadata in the same place regardless of where the assets are stored - to simplify things.

Download Asset Steps

An edX user requests an asset via GET.
LMS re-directs the user to a different download link.
1. Re-direct URL depends on the storage used, the asset lock status, and the user permissions.
Download completes with success or failure.

New AssetStore API

AssetStore will parallel the current contentstore and will be a top-level, storage-agnostic interface to store asset data and metadata. Different storage implementations, such as S3, GridFS, and filesystem, will plug-in behind this common interface.

Below is a sketch of the API:

Storage By Author

upload_asset(course_key, filename, file)
- Stores a file in the appropriate storage.
- On success, saves file metadata in asset index.
- Parameters:
  - course_key: a CourseKey object specifying the course for which the asset is intended
  - filename: the original name of the file
  - file: all the bytes of the uploaded file

Download By User

get_asset_url(asset_key, ttl_seconds)
- Any user permission checks for locked assets are performed before this call.
- Parameters:
  - asset_key: an AssetKey specifying the asset and its version which needs to be retrieved
  - ttl_seconds: time for which the URL should be valid (in seconds)
- Returns:
  - URL which browser will use to download the asset

New Storage Implementations

S3Store

Summary

Amazon's S3 is a web-service that stores files and allows their download via HTTP. To use the service, an AWS user creates one or more buckets in which objects are stored. Objects are referenced by name. For tons of info on S3, see the API reference. edX will use the AWS boto Python module for access to S3 - the module documentation is here.

Since the S3 storage is external and enabled by the external AWS, an asset download URL given to one user would work for any Internet user. To avoid locked asset access by non-participating users, the code would generate temporary URLs which expire after a configurable number of seconds. The boto.s3.key object has a method generate_url() that supports temporary URL generation.

One requirement for asset download is that the downloaded asset is saved to disk with the original filename. With S3, we can specify the filename directly by overriding the Content-Disposition header. The code would pass back the generated URL for the client to GET with the following parameters:

http://s3.amazonaws.com/<generated_url>?response-content-disposition=attachment%3B%20filename%3D<original_filename>

S3 can encrypt any objects you upload to a bucket. But we will not encrypt the assets - that security level is not a requirement and would add needless complexity. All asset files will be uploaded with a "public-read" canned ACL.

Here's the S3 API details:

upload_asset(course_key, filename, file)
- Validates the course key & finds the course.
- Generates an MD5 hash on the asset file bytes.
- Creates a unique object name that looks like this:
  - <org>/<course>/<filename>/<MD5 hash>
- The asset will be uploaded to S3 using this unique name.
  - If an asset is updated, it'll be a new asset with a new unique name due to the MD5 hash changing.
  - If a course is rerun and therefore cloned, all the assets will have the same unique names as the original course assets.
    - So the original course's assets won't need to be cloned to provide assets for the new cloned course.
- Creates a Key for the proper S3 bucket.
- Set the Key.key to the unique_name.
- Set Key.md5 to the hash (which provides extra data integrity protection - it's recommended - and we've already computed it.)
- Set the Key's metadata (like 'filename').
- Set the Key's ACL to 'public-read'.
- Upload the file bytes!
- On success, save the asset metadata via the proper modulestore and report succes.
- Report failure on upload failure/timeout.
get_asset_url(asset_key, ttl_seconds)
- Open a connection to S3.
- Lookup the storage bucket.
  - If not found, create? No bucket, no asset - so raise error.
- Using the unique name from the asset key, lookup the object key.
- Generate a time-limited URL usable to download the asset (with expiry set according to ttl_seconds).
- Append the response-content-disposition header to download as the proper filename.
- Return the URL.

Direct Client Upload vs. Upload via edX

Since the assets are stored externally with S3, there's technically no need to upload an asset to Studio and then have Studio upload the asset to S3. Direct client upload is supported by S3 - here's an article specifying how to do this direct upload. And here's a Django application that assists in direct S3 upload. There's pros and cons to each method - here's the cons:

Course author asset upload will occur much less than course asset download - and long uploads will therefore take up fewer application connections than long direct downloads (which is the actual current problem).
Common client code will become more complex - both for S3 storage and non-S3 storage implementations that are still supported.
A S3 asset naming scheme which uses MD5 hash in the storage path will not be possible, since a hash can't be generated before S3 upload. The client might be capable of generating a hash but trusting the client to generate a proper hash is a security problem.
Thumbnails will be unable to be generated for assets sent directly to S3 from the client. Again, the client might be able to generate a thumbnail but that's also a security risk.

And the pros:

Faster asset upload completion - one long hop to S3 (client->S3) instead of one long hop and one short/medium hop (client->edX->S3).
No worker threads tied up in long asset uploads.

Upload via edX w/Job-Based S3 Upload

There's also another option - upload via edX with job-based S3 upload. In this method, an author uploads an asset to Studio and then Studio queues up an upload job to S3 via Celery. The asset then enters an "in_progress" state. After job completion, the asset would show up as a normal course asset on success. On failure, it'd retry the upload a configurable number of times. If the S3 upload still failed after retries, the author would be notified somehow. It'd have some advantages over regular upload via edX, like not tying up a gunicorn worker thread during S3 upload. But, unfortunately, this method is equal or higher in complexity than direct client upload, due to:

asynchronous uploading to S3
addition of the in_progress asset state
the need to notify the user of success/failure after they've navigated away from the Files & Uploads page

FileStore

Django can store files and serve them from the local filesystem. This implementation would use the local filesystem.

Here's the FileStore API details:

upload_asset(course_key, filename, file)
- Generates an MD5 hash on the file bytes.
- Creates a file path that looks like this:
  - <org>/<course>/<filename>
- Using a pre-configured file root:
  - create any part of the <org>/<course>/<filename> path (i.e. unique_name) that's not yet created
  - save the file with a filename of the MD5 hash value
- On success, save the asset metadata via the proper modulestore and return True.
- Returns False on file write failure.
get_asset_url(asset_key, ttl_seconds)
- Using the unique_name (filepath), validate that the file can be opened.
- If file not present/can't be opened, return error.
- Else return the URL to the asset.

In a multi-webserver environment, the asset root would need to point to shared storage in order for the saved asset files to be served from all webservers.

MongoStore

The GridFS implementation is wrapped behind the current contentstore() singleton. It stores the actual assets and the metadata about assets in two different collections. Ideally, we'd re-use the contentstore() code and wrap it with the new AssetStore API. The big difference is that the asset metadata will no longer be stored in the contentstore - it'll instead be stored in the appropriate modulestore for each course.

Here's the MongoStore API details:

upload_asset(course_key, filename, file)
- (See _upload_asset() in edx-platform/cms/djangoapps/contentstore/views/assets.py for general steps.)
- Ensure course existence.
- Generate thumbnail.
- Save the asset and thumbnail to GridFS.
- Extract saved asset metadata.
- Save the asset metadata in the appropriate modulestore.
get_asset_url(asset_key, ttl_seconds)
- Using the asset_key, find the modulestore asset metadata.
- Form the usual c4x/ pre-pended URL.
- Return it!

Course Storage Implementation Configuration

A created course's asset storage implementation will default to either GridFS or filesystem upon course creation. All assets for a particular org/course will be stored in the same location. The configuration which determines which asset storage is used for a course will live in the same place in the modulestore as the asset metadata. To switch a course's asset store to S3 (or back to GridFS or filesystem), a privileged site user (not the course author!) will interact with a new Django admin interface ideally. As a preliminary step, we'd implement a manage.py option to switch a course between asset stores. A course author is unlikely to care where their course's assets are stored and served from - DevOps will be the primary user of this interface. Also, changing the asset store implementation should likely clear out all the asset information in preparation for the course's first assets to be uploaded or all the course's asset to be imported to the correct, newly chosen location.

Course Import & Export

When a course author exports an edX course, it's designed to be a self-contained package of all the course content, including all course assets. If the exported course package is then imported in a different edX instance, all course content including assets should be available to course participants. This requirement implies that all course assets need to be contained in the exported course package, irregardless of the storage location of the course assets.

Course Export

According to the new design, there are three different locations where assets can be stored. If a course stores its assets in GridFS or on the local filesystem, then the assets can be quickly retrieved and added to the course tarball for download. If a course stores its assets on S3, then the edX app will need to download each asset from S3 in order to add all assets to the tarball for download. Depending on the total size of all course assets, this download could potentially take a few minutes. For AWS-hosted edX instances, this download time would hopefully be minimal, as the file transfers would happen internal to AWS. Course export time might be increased when using S3 storage - meaning a greater need to shift course export to a background task. NOTE: According to edX DevOps, the largest total size of course assets is currently around 1GB. But most course have a total asset size of tens of MB.

If course assets were imported under a particular filepath, then those assets will have the same filepath upon export. This proper placement will be achieved using the asset metadata, which will store the original filepath along with other metadata.

Course Import

Before a course is imported, an asset storage configuration must be chosen which is one of the three available specific storage implementations - filesystem, GridFS, or S3. Upon course import, the import process saves each asset to the configured storage. If the chosen storage is filesystem or GridFS, then the assets can be stored in a relatively short time. If the chosen storage is S3, then each asset will need to be uploaded to S3 (if not already present). The S3 upload could potentially take a significant amount of time, depending on the total size of all course assets. edX has near-term plans to shift course import to a background task to help with issues around a lengthy course import.

If course assets exist under a particular filepath in the import tree, then the filepath will be captured in the asset metadata and preserved if the asset is later imported.

Migration Plan

A method for migrating a course from using one kind of storage implementation to a different one is also needed. To accomplish this migration, here's the proposed procedure:

Export the course to local disk.
Change the storage implementation configuration for the course.
Import the exported course.

Using export/import for asset storage migration would avoid the need for a separate migration script that'd need to be developed and maintained separately. Course import and export are required features that will be maintained - migration would simply use the same mechanism.

User Stories and Rollout Plan

The GridFS replacement work will be tracked using the JIRA Epic tasks listed below:

Here's the user stories written for this work which will be added to the Epic stories above - some of these will be converted to Acceptance Criteria instead:

As a DevOps team member, I want to be able to switch the asset storage/serving of a particular active course to S3 without affecting the course functionality.
As a DevOps team member, I don't want to be forced to migrate all course assets upon deployment of the new S3 support - all courses with GridFS assets should still serve those assets.
As a student, I want all my course assets to download with the same filename they had when they were uploaded by the course author.
As a course author, I want to be able to lock course assets so they are only accessible to course participants, irregardless of where the assets are stored and served from.
As a course author, I want to be able to update a course asset and serve that new asset to students.
As a course author, I want to be able to generate a permanent URL to a course asset that downloads the asset irregardless of whether the asset is locked or unlocked.
As a developer, I want to be able to choose between filesystem, GridFS, and AWS S3 as storage locations where assets for a particular course are stored.
As a course author, I want a course export to contain all course asset files, irregardless of where the assets are stored and served from.
As a course author, I want all assets in an exported course to export to the same directory structure in which they were uploaded.
As an Analytics user, I want the same course asset served from any storage location to count as a download of that course asset.
As a DevOps team member, I want to be able to find and easily remove content from courses that has been identified as in violation of copyright laws.

Here's a high-level rollout plan for the work:

V1
- Deploy AssertStore API and MongoStore implementation. Course assets are now served via the new API but are still in GridFS with the same format (most likely).
V2
- Deploy S3Store implementation. Support course asset migration between the MongoStore and S3Store storage implementations.
V3
- Deploy FileStore implementation. Support migration between all three storage implementations. Switch devstack development environment to use FileStore implementation as default.