S3 Transcripts Metadata

Note

Finalized design decisions have been taken at Video Transcript Design document while these approaches were the initial drafted proposals.

Multiple Video Sources

A Video Component having multiple video sources should not have different transcript content for each source (for the same language). For instance, it does not make sense for a video component to have multiple english transcripts with different content each.

Advanced Settings

Users will be required to add a video source before adding any transcript from Video Component's Advanced Settings, this is because S3 Transcripts are video sources specific. So, a video source need to be there in order to link any transcript to it.

Video Component Advanced Settings Input Fields	Internal/Technical Field Name
"Default Timed Transcript"	`sub`
"Transcript Languages"	`transcripts`

Background

We have sub and transcripts field on video component. The sub is used to store english transcripts metadata – i.e. en transcript's filename. While transcripts is a dict field in a video component and it contains non-english transcripts metadata – i.e. transcript's filename against the corresponding language, e.g. {'es': 'transcript_file_name.srt'} and the transcript content for 'transcript_file_name.srt' is in Content-store.

Moving on S3 from Content-Store

Now that we are replacing transcripts storage from Mongo Content-store to S3 and the transcript metadata (i.e. transcript language, the video id to which this transcript is linked, etc.) is now in edx-val's VideoTranscript data model unlike previously, where the transcript metadata was on Video Component in the above mentioned transcripts and sub fields.

Approaches:

Approach # 1

Request transcripts from VAL's VideoTranscript data model every time from the Component (i.e. on loading video component) and do not set it as sub and transcripts on Video Component itself.

Pros:

Since, we are also planning to upload transcripts for VAL videos directly from Video Uploads Page, when we'll upload a transcript from Video Uploads Page for a VAL Video, it will start showing up on all those Video Components who are/were using that edx_video_id.
For auto-generated transcripts, the generated transcript will automatically show up on all those Video Components who are/were using that edx_video_id.
We will use VideoTranscript data model as single source of truth, we will not be managing the transcript metadata on multiple places (i.e. VAL VideoTranscript and Video Component)

Cons:

Scenario 1 from "Video Component Scenarios with Contentstore (transcript-related)" will not work
Scenario 6 from "Video Component Scenarios with Contentstore (transcript-related)" will not work
Scenario 7 from "Video Component Scenarios with Contentstore (transcript-related)" will not work
We won't be able to delete any transcript from Video Component because all deletion-related UI is strictly bound to Video Component metadata (i.e. sub and transcripts). For this reason, we should be going for approach # 2

Implementation:

We will retrieve transcripts from edx-val and include them in transcripts field's context (that is going to be rendered on Video Advanced Settings TAB) on loading a video component and whenever a video component is saved, we will look for VAL's included transcripts and discard them from transcripts field so that they don't get persisted on Video Component.

Approach # 2

Set transcript metadata from VideoTranscript to Video Component (e.g. on adding a new source) and use it afterwards – (i.e. do not ask VideoTranscript data-model for transcript metadata everytime). When we delete a transcript from Video Component (i.e. from Advanced Settings: "Default Timed Transcript" OR "Transcript Languages" fields), it will not be deleted from VAL(S3) instead it will just be unlinked from Video Component.

Why soft-delete?

Assuming that deleting a transcript from video component also removes the corresponding transcript from VideoTranscript data model and S3, If a transcript is deleted from a video component's advanced settings then, the other video components who are using the same video source(s) will run stale because those are also using the removed transcript.

Pros:

Once a source is added to a video component, all of its transcripts are now Video Component specific in terms of deletion
Deleting a transcript from video component does not remove it from VideoTranscript data model Or from S3, it will be unlinked from this video component (i.e. soft-deleted for this video component)
Covers all the documented scenarios: Video Component Scenarios with Contentstore (transcript-related)

Cons:

If we store S3 transcript metadata on video component as well as in VideoTranscript data model then:

Video component may have different transcripts metadata than the VideoTranscript data model for a same video source.
For our plan to upload transcripts for VAL videos directly from Video Uploads Page, when we'll upload a transcript from Video Uploads Page for a VAL Video, it will not autometically show up on all those Video Components which were already using that edx_video_id (from the past).
For auto-generated transcripts, generated transcript will not autometically show up on all those Video Components who were using that edx_video_id (from the past).

Implementation:

Every added source will bring its transcript metadata from VideoTranscript data model to the Video Component's sub and transcripts.

Architecture and Engineering