Transcripts Migration Statistics
There were 7767 courses for which transcripts were migrations from content-store to S3.
Date | Job Run # | Migrated Courses |
---|---|---|
1 | 0 | |
2 | 0 | |
3 | 2 | |
4 | 2 | |
5 | 100 | |
6 | 500 | |
7 | 500 | |
8 | 1000 | |
9 | 1000 | |
10 | 1000 | |
11 | 1000 | |
12 | 1000 | |
13 | 1000 | |
14 | 1000 | |
15 | 1000 | |
16 | 663 |
- 1st and 2nd job runs were dry runs, they weren't supposed to migrate anything.
- 13th, 14th and 15th job runs migrated 1000 same courses repeatedly because 13th, 14th runs were unable to track their migrated courses due to an uncaught exception. Issue was fixed before 15th run and thanks to the idempotency of the job as repeating the transcript migrations for a set of courses didn't have any side-effects.
The above artifacts can be found on Splunk, an example query can look like the following where run
specifies a job run.
index=prod-edx "Transcript Migration" "run=12" "video-transcripts-migration-process-started-for-course"
Videos-wise Migration Stats
Transcripts have been successfully migrated for ~756,201 videos and around ~161,944 external videos have been created in edxval
from the corresponding video components. Below are the results gathered from the logs emitted by transcripts migration job on Splunk.
Run # | Videos submitted for Transcripts Migration | Videos with no transcripts | Videos completed Transcripts Migration | Number of External Videos |
---|---|---|---|---|
1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 |
3 | 278 | 2 | 276 | 216 |
4 | 250 | 6 | 244 | 1 |
5 | 11,155 | 723 | 10,432 | 3,072 |
6 | 50,555 | 5,611 | 45,029 | 10,346 |
7 | 50,651 | 5,394 | 45,334 | 9,865 |
8 | 97,303 | 8,684 | 88,807 | 18,547 |
9 | 107,406 | 11,073 | 96,333 | 19,764 |
10 | 101,718 | 13,636 | 88,071 | 18,946 |
11 | 103,437 | 10,053 | 93,630 | 20,023 |
12 | 98,299 | 11,585 | 86,895 | 17,756 |
13 | 101,329 | 10,037 | 91,512 | 19,799 |
14 | 99,473 | 9,902 | 89,783 | 9,100 |
15 | 102,728 | 9,382 | 91,150 | 3,674 |
16 | 64,315 | 7,964 | 56,278 | 10,835 |
Below are the queries for the above mentioned artifacts, "run" can be adjusted to the desired job run:
# Videos submitted for the Migration excluding those videos that are having not video transcripts index=prod-edx "Transcript Migration" "run=16" "transcripts-migration-tasks-submitted" # Videos without transcripts index=prod-edx "Transcript Migration" "run=16" "transcripts-migration-tasks-submitted" # Videos completed Transcripts Migration index=prod-edx "Transcript Migration" "run=16" "video-transcripts-migration-complete-for-a-video" # Number of created External Videos (i.e. That do not have edx_video_id set) index=prod-edx "Transcript Migration" "run=16" "generated-edx-video-id"
Transcripts Migration Stats
Total of ~990,378 transcripts were successfully migrated from Content-store to S3. There were small amount of migrations failures too which can be seen below and to be exact, it is 0.283% of the successfully migrated transcripts which is 2,806 in amount.
Run # | Number of successfully migrated transcripts | Number of transcripts whose migration failed | Transcript with no content to migrate |
---|---|---|---|
1 | 0 | 0 | 0 |
2 | 0 | 0 | 0 |
3 | 276 | 0 | 0 |
4 | 242 | 0 | 2 |
5 | 11,791 | 81 | 0 |
6 | 54,688 | 4 | 54 |
7 | 50,213 | 3 | 65 |
8 | 98,717 | 8 | 55 |
9 | 109,211 | 2 | 158 |
10 | 99,990 | 3 | 75 |
11 | 103,298 | 5 | 73 |
12 | 96,878 | 9 | 119 |
13 | 101,484 | 5 | 115 |
14 | 99,206 | 0 | 100 |
15 | 101,217 | 2,495 | 97 |
16 | 63,167 | 191 | 52 |
Below are the queries for the above mentioned artifacts, "run" can be adjusted to the desired job run:
# Number of successfully migrated transcripts index=prod-edx "Transcript Migration" "run=16" "video-transcript-migration-succeeded-for-a-video" # Number of transcripts whose migration failed index=prod-edx "Transcript Migration" "run=16" "video-transcript-migration-failed-with-unknown-exc" # Transcript with no content to migrate index=prod-edx "Transcript Migration" "run=16" "video-transcript-migration-failed-with-known-exc"
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Courses | 0 | 0 | 2 | 2 | 100 | 500 | 500 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 663 |
Videos | 0 | 0 | 276 | 244 | 10432 | 45029 | 45334 | 88807 | 96333 | 88071 | 93630 | 86895 | 91512 | 89783 | 91150 | 56278 |
Transcripts | 0 | 0 | 276 | 242 | 11791 | 54688 | 50213 | 98717 | 109211 | 99990 | 103298 | 96878 | 101484 | 99206 | 101217 | 63167 |
Failures | 0 | 0 | 0 | 0 | 81 | 4 | 3 | 8 | 2 | 3 | 5 | 9 | 5 | 0 | 2495 | 191 |
Let's talk about the 0.283% failures(i.e. 2,806 transcripts), this is taken as the max number of the transcripts that might not have been migrated due the exceptions occurred. The following is what I have observed in manual verification of errors:
Migration failures falling into the below categories are ignore-able:
- Integrity error raised due to race conditions
- For a few legacy courses, transcripts are stored in a different format than their actual content-type (e.g. sjson content in SRT file)
Also, we can see, there are considerably significant amount of failures in 15th run as compared to other runs, and most of these are for non-english transcript languages that failed on decoding with utf-8-sig
.