Transcripts Migration Statistics

There were 7767 courses for which transcripts were migrations from content-store to S3.

DateJob Run #Migrated Courses
 
10
 
20
 
32
 
42
 
5100
 
6500
 
7500
 
81000
 
91000
 
101000
 
111000
 
121000
 
131000
 
141000
 
151000
 
16663
  • 1st and 2nd job runs were dry runs, they weren't supposed to migrate anything.
  • 13th, 14th and 15th job runs migrated 1000 same courses repeatedly because 13th, 14th runs were unable to track their migrated courses due to an uncaught exception. Issue was fixed before 15th run and thanks to the idempotency of the job as repeating the transcript migrations for a set of courses didn't have any side-effects.


The above artifacts can be found on Splunk, an example query can look like the following where run specifies a job run.

index=prod-edx "Transcript Migration" "run=12" "video-transcripts-migration-process-started-for-course"

Videos-wise Migration Stats

Transcripts have been successfully migrated for ~756,201 videos and around ~161,944 external videos have been created in edxval from the corresponding video components. Below are the results gathered from the logs emitted by transcripts migration job on Splunk.

Run #Videos submitted for Transcripts MigrationVideos with no transcriptsVideos completed Transcripts Migration Number of External Videos
10000
20000
32782276216
425062441
511,15572310,4323,072
650,5555,61145,02910,346
750,6515,39445,3349,865
897,303
8,68488,80718,547
9107,406
11,07396,33319,764
10101,71813,63688,07118,946
11103,43710,05393,63020,023
1298,299
11,58586,89517,756
13101,329
10,03791,51219,799
1499,473
9,90289,7839,100
15102,728
9,38291,1503,674
1664,3157,96456,27810,835


Below are the queries for the above mentioned artifacts, "run" can be adjusted to the desired job run:

Splunk Queries
# Videos submitted for the Migration excluding those videos that are having not video transcripts
index=prod-edx "Transcript Migration" "run=16" "transcripts-migration-tasks-submitted" 

# Videos without transcripts
index=prod-edx "Transcript Migration" "run=16" "transcripts-migration-tasks-submitted" 

# Videos completed Transcripts Migration 
index=prod-edx "Transcript Migration" "run=16" "video-transcripts-migration-complete-for-a-video" 

# Number of created External Videos (i.e. That do not have edx_video_id set)
index=prod-edx "Transcript Migration" "run=16" "generated-edx-video-id" 

Transcripts Migration Stats

Total of ~990,378 transcripts were successfully migrated from Content-store to S3. There were small amount of migrations failures too which can be seen below and to be exact, it is 0.283% of the successfully migrated transcripts which is 2,806 in amount.

Run #Number of successfully migrated transcriptsNumber of transcripts whose migration failedTranscript with no content to migrate
1000
2000
327600
424202
511,79181
0
654,688454
750,213365
898,717855
9109,2112158
1099,990375
11103,298573
1296,8789119
13101,4845115
1499,2060100
15101,2172,49597
1663,16719152


Below are the queries for the above mentioned artifacts, "run" can be adjusted to the desired job run:

Splunk Queries
#  Number of successfully migrated transcripts
index=prod-edx "Transcript Migration" "run=16" "video-transcript-migration-succeeded-for-a-video"

#  Number of transcripts whose migration failed
index=prod-edx "Transcript Migration" "run=16" "video-transcript-migration-failed-with-unknown-exc"

# Transcript with no content to migrate
index=prod-edx "Transcript Migration" "run=16" "video-transcript-migration-failed-with-known-exc" 

12345678910111213141516
Courses002210050050010001000100010001000100010001000663
Videos00276244104324502945334888079633388071936308689591512897839115056278
Transcripts002762421179154688502139871710921199990103298968781014849920610121763167
Failures0000814382359502495191

Let's talk about the 0.283% failures(i.e. 2,806 transcripts), this is taken as the max number of the transcripts that might not have been migrated due the exceptions occurred. The following is what I have observed in manual verification of errors:

Migration failures falling into the below categories are ignore-able:

  1. Integrity error raised due to race conditions
  2. For a few legacy courses, transcripts are stored in a different format than their actual content-type (e.g. sjson content in SRT file)


Also, we can see, there are considerably significant amount of failures in 15th run as compared to other runs, and most of these are for non-english transcript languages that failed on decoding with utf-8-sig.