Initial plan is to migrate transcripts to S3 in batches after releasing the 3rd Party transcripts feature.
edX Platform
There will be a management command which will retrieve course's active_versions
and traverse them to fetch their video components. It will collect transcripts metadata from the retrieved components and use the metadata to query the transcript content from content store. If transcript content is found, upload it to S3 and add the its metadata in Transcript
model (i.e. using ). Management command will have a flag edx_video_id
or prioritized external_video_id
overwrite
which decides whether we are going to overwrite the existing transcripts in S3 with the ones from Contentstore.
Open edX Platforms
...
Objective:
Migrate transcript data from contentstore to S3 in a live production instance with negligible downtime.
Strategy:
The transcripts on S3 will be made available to the users ‘one course at a time’ after the transcript data has been migrated for that course. A management script will be written which will have course level progress checkpoints and restartability features while ensuring ACID properties for the migration transaction.
Atomicity
The migration transaction atomicity will be at the course level i.e. course transcripts are migrated in their entirety. If for any reason (connection etc.), an error occurs and all or some of the transcripts are not migrated for a course, then the migration transaction for that course must be rolled back. Rollback would involve bringing S3 to the pre-transaction state.
Consistency
At the end of a course transcripts migration run, the system should be in a valid state. Our measure of consistency will be the transcript count of the course in contentstore before-migration and on S3 after-migration. We cannot use the ‘total transcript count in the system’ as a measure in a live system with end users potentially making many updates (adding, updating transcript files)
Isolation
The migration will run in a live Production instance. Hence, while the migration for a course is being run, end users should not be able to update the transcripts for videos in the course. Otherwise, there could be a case where an already migrated transcript is updated on the contentstore and subsequently the non-updated S3 version is presented to the user. We feel that the overhead to implement transcript locks may be an over-kill in this case. As a workaround, the end users could be sent out a mass-communication so that they avoid uploading or updating transcripts during the planned time window for running the migration script.
Durability
Once the transcripts for a course have been successfully migrated, all of the changes made to S3 are permanent. As the transcripts have been made available to the end users, it must be ensured that a subsequent run of the script does not overwrite the S3 data. Each course migration event will be logged in a migration status table.
Implementation
A Django management script will be written that will traverse the contentstore and for each course will find all the video objects. Any transcripts related to the found video objects will be pushed to S3
Pseudo-Code
Happy Path:
Search for all courses in ContentStore
Put the Course ids in the migration status table with status as ‘Not-Migrated’
Run the following code in a Try-Catch
For each course id in the migration status table
Update the migration status for that course from ‘Not-Migrated’ to ‘In-Progress’
Search for videos
For each video, search for transcripts
For each transcript
push data to S3
If all the videos of a course have been processed, update course migration status in the migration status table from ‘In-Progress to Migrated’
Update the Feature Flag to switch the user to S3\ transcripts
Exception Path:
In case of any Exception
Log the exception with Course Id in a error-log file
Update the migration status for that course from ‘In-Progress’ to ‘Failed’
Remove all transcripts of the Course from S3
Rollout
Mock Runs
There should be atleast two Mock runs of the script on a refreshed staging instance. Issue fix cycle will follow each Mock run.
Validation
Migration validation will be at the course level. Before updating the feature flag to enable Phase II for a course, the transcript count for the course will be compared in both contentstore and S3
End User Communication
Migration time window will be chosen as a time of least activity and users will be pre-informed to desist from updating adding transcripts during the time window
Rollout and Post Rollout Support
Team Mallow will provide any pre and post rollout support
Open Issues:
- Where should the script search for transcripts … in draft, published or both
- Should we delete any partially migrated data for a “failed migration” course from S3. Alternately, we could overwrite the data in a subsequent run of the script