Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The migration transaction atomicity will be at the course video level i.e. course video transcripts are migrated in their entirety. If for any reason (connection etc.), an error occurs and all or some of the transcripts are not migrated for a coursevideo, then the migration transaction for that course video must be rolled back. Rollback would involve bringing S3 to the pre-transaction state.

Consistency

At the end of a course video transcripts migration run, the system should be in a valid state. Our measure of consistency will be the transcript count of the course video in contentstore before-migration and which should be equal or less than the count on S3 after-migration. We cannot use the ‘total transcript count in the system’ as a measure in a live system with end users potentially making many updates (adding, updating transcript files)

...

The migration will run in a live Production instance. Hence, while the migration for a course video is being run, end users should not be able to update upload the transcripts for videos the video in the course. Otherwise, there could be a case where an already migrated transcript is updated on the contentstore and subsequently the non-updated S3 version is presented to the user. We feel that the overhead to implement transcript locks may be an over-kill in this case. As a workaround, the end users could be sent out a mass-communication so that they avoid uploading or updating transcripts during the planned time window for running the migration script.contentstore. This will be accomplished by releasing code to Prod where new / edited video transcripts start getting put into edx-val/S3 instead of the contentstore.

  1. This means that during the migration period, a single course could simultaneously have some video transcripts in the contentstore and some in S3.
  2. This way, we do not have to coordinate with course teams and block their edits to courses

Durability

Once the transcripts for a course video have been successfully migrated, all of the changes made to S3 are permanent. As the transcripts have been made available to the end users, it must be ensured that a subsequent run of the script does not overwrite the S3 data. Each course migration event will be logged in a migration status tablevia the PersistOnFailureTask mixin. The script will be idempotent so it can be re-run even if it ran before and failed midway.

Implementation

A Django management script will be written that will traverse the contentstore and for each course will find all the video objects. Any transcripts related to the found video objects will be pushed to S3 as an atomic task.

Pseudo-Code

Happy Path:

Search for all courses in contentstoremodulestore()

Put the course ids in the migration status table with status as ‘Not-Migrated’, donot overwrite course ids if already present
Run the following code in a Try-Catch

For each 'Not-Migrated' or 'Failed' course id in the migration status table create a celery task with PersistOnFailureTask mixin which will:

    Update the migration status for that course from ‘Not-Migrated’ to ‘In-Progress’

    Search for videos

        For each video , search for transcripts

            For each transcript

                push data to S3

    create an atomic transaction

            Try

                Do a diff between the video transcripts in the course (modulestore) and the video transcripts in edx-val.

                Push data to S3 of only those that are not already in edx-val.

                edx-val table updates get committed only after all transcripts for that video are uploaded to S3

            Catch

                 Raise the exception and let it be caught by the PersistOnFailureTask

        If all the videos of a course have been processed, update course migration status in the migration status table from ‘In-Progress to Migrated’

            Update the Feature Flag to switch the user to S3 transcripts


Exception Path:

In case of any task(Course) level Exception

    Log the exception with Course Id in a error-log file    Update Retry after 2 sec with maximum retries of 3

    After 3 retries, update the migration status for that course from ‘In-Progress’ to ‘Failed’ in the migration status table

    Remove all transcripts of the Course from S3  

Rollout

Mock Runs

There should be atleast two Mock runs of the script on a refreshed staging instance. Issue fix cycle will follow each Mock run.

...

Migration validation will be at the course video level. Before updating the feature flag to enable Phase II for a course, the transcript count for the course video will be compared in both contentstore and S3

...

Migration time window will be chosen as a time of least activity and users will be pre-informed to desist from updating adding transcripts during the time window

Rollout and Post Rollout Support

...

  1. Where should the script search for transcripts … in draft, published or both
  2. Should we delete any partially migrated data for a “failed migration” course from S3. Alternately, we could overwrite the data in a subsequent run of the script (Resolved. See section 4.2 in comments by Nimisha)