Migration from contentstore to s3

Objective:

Migrate transcript data from contentstore to S3 in a live production instance with negligible downtime.

Strategy:

The transcripts on S3 will be made available to the users ‘one course at a time’ after the transcript data has been migrated for that course. A management script will be written which will have course level progress checkpoints and restartability features while ensuring ACID properties for the migration transaction.

Atomicity

The migration transaction atomicity will be at the video level i.e. video transcripts are migrated in their entirety. If for any reason (connection etc.), an error occurs and all or some of the transcripts are not migrated for a video, then the migration transaction for that video must be rolled back. Rollback would involve bringing S3 to the pre-transaction state.

Consistency

At the end of a video transcripts migration run, the system should be in a valid state. Our measure of consistency will be the transcript count of the video in contentstore which should be equal or less than the count on S3 after-migration. We cannot use the ‘total transcript count in the system’ as a measure in a live system with end users potentially making many updates (adding, updating transcript files)

Isolation

The migration will run in a live Production instance. Hence, while the migration for a video is being run, end users should not be able to upload the transcripts for the video in contentstore. This will be accomplished by releasing code to Prod where new / edited video transcripts start getting put into edx-val/S3 instead of the contentstore.

This means that during the migration period, a single course could simultaneously have some video transcripts in the contentstore and some in S3.
This way, we do not have to coordinate with course teams and block their edits to courses

Durability

Once the transcripts for a video have been successfully migrated, all of the changes made to S3 are permanent. As the transcripts have been made available to the end users, it must be ensured that a subsequent run of the script does not overwrite the S3 data. Each course migration event will be logged via the PersistOnFailureTask mixin. The script will be idempotent so it can be re-run (after) a failed previous run.

Implementation

A Django management script will be written that will traverse the contentstore and for each course will find all the video objects. Any transcripts related to the found video objects will be pushed to S3 as an atomic task. In the transaction, a video's transcripts metadata will be migrated from video component to edxval and the corresponding transcript content will be migrated from contentstore to S3.

Pseudo-Code

Happy Path:

Search for all courses in modulestore()

For each course id, create a celery task with PersistOnFailureTask mixin which will:

Search for videos

For each video create an atomic transaction

Try

Do a diff between the video transcripts in the course (modulestore) and the video transcripts in edx-val.

Push data to S3 of only those that are not already in edx-val.

edx-val table updates get committed only after all transcripts for that video are uploaded to S3

Catch

Raise the exception and let it be caught by the PersistOnFailureTask

If all the videos of a course have been processed

Update the Feature Flag to switch the user to S3 transcripts

Exception Path:

In case of task(Course) level Exception

Retry after 2 sec with maximum retries of 3

Rollout

Mock Runs

There should be atleast two Mock runs of the script on a refreshed staging instance. Issue fix cycle will follow each Mock run.

Validation

Migration validation will be at the video level. Before updating the feature flag to enable Phase II for a course, the transcript count for the video will be compared in both modulestore and edx-val

End User Communication

Migration time window will be chosen as a time of least activity

Rollout and Post Rollout Support

Team Mallow will provide any pre and post rollout support

Open Issues:

Where should the script search for transcripts … in draft, published or both (Resolved: both will be migrated)
Should we delete any partially migrated data for a “failed migration” course from S3. Alternately, we could overwrite the data in a subsequent run of the script (Resolved. Transcripts will not be overwritten. See section 4b in comments by Nimisha)

Architecture and Engineering