VEDA Escalations
the VEDA SLA is currently 24 hours. Videos are often non-deterministic in their encoding, a single video can fail encoding once, and be successful the following run without making any code or encoding strategy changes. However, if a video is consistently failing to complete, this likely points to a deeper problem with VEDA itself.
Splunk
VEDA is currently logging to Splunk under index=prod-edx-veda
.
This index is searchable via both Studio ID (looks like 99347674-28ed-4390-8cf0-f07b42961798
) or VEDA ID (looks like UCSUCSDX2017-V005200
).
A good rule of thumb is to add an 'OR' clause when you know both IDs to get complete logging for the video as it moves through the system. Errors in certain phases can point to either a software bug, a faulty detection method, or a malformed setting. The following steps should indicate the origin of a problem if a video fails to complete in a timely fashion.
- Discovery
- Ingest
- Enqueue
- Encode
- Deliver
- HEAL
Splunk Alerts
Video IDs with uncompleted encodes that are older than 25 hours to veda-dev@edx.org
Runsheets:
I. "A single video isn't completing"
Search for the studio ID in Splunk or log into VEDA django admin an search the provided studio ID, and it should correspond to a VEDA generated ID that's human parsable:
There is no VEDA record corresponding to the Studio generated ID:
The video upload has failed, and the course team must reupload the video
The VEDA status is "Complete", but not showing "READY" in Studio:
If this is an isolated incident, this is likely a CAP problem, and the video will show "READY" in studio the next time the 24 hour maintenance process ("HEAL") runs, as this also includes a step for VEDA/VAL data parity. VAL is creaky and sometimes exceeds its 60 second timeout window.
If the VEDA status is "In Progress" or "Active Transcode":
There is a missing encode. If the video "Process Start" field is more than 24 hours old, then the encode has been reattempted and failed. This indicates a problem with a specific encode and/or video, and generally not a systemic problem.
Search the URL table for records corresponding to the VEDA generated ID (e.g. "UCSUCSDX2017-V005200")
A complete file should show five encodes if the course is youtube enabled, or four if not.
If there are 0 encodes or only an HLS encode, then the video file may be corrupt, or the read metadata might be jumbled. (QA for HLS is less precise than for static files)
Reading the delivery node log might provide some context, and reading the encode node logs might provide some as well.
MOST COMMON: If youtube is the only encode missing, then an issue exists with the youtube version of the file. If the file hasn't shown up as a "youtube duplicate", then reading of the youtube logs might be appropriate to glean further knowledge. It's possible that the file is a duplicate, but the status is not catching. You can check for logs on Splunk for prod-edx-veda
index to understand where the code broke.
II. "My whole course isn't completed"
Find an ID associated with this course using the steps above and check to see if the file is uploaded and has a few successful encodes.
Most likely, the youtube setup is either incomplete (A bad Youtube Partner CMS owner/channel pairing) or the course record in VEDA is misconfigured.
- If the Youtube CMS is paired wrong, a member of the author support team can resolve the issue.
- If the VEDA Course Record is misconfigured, it can be fixed via the VEDA django admin.
NOTE: Due to our evolving relationship with google/youtube, occasionally Youtube Partner CMS accounts can become unpaired from their partner/owners. If a single new video from an older course is failing to upload to youtube, even if previous videos worked (>6 months ago), this may be the case.
Reprocess the course once the issue is resolved via HEAL.
III. "Nothing is completing"
Most likely a service is broken and hasn't paged.
VEDA admin can provide some context.
- If there are no youtube IDs, then youtube_callback is most likely broken.
- If no new videos have come in, then ingest is likely the issue.
- If there are no encodes at all, then delivery is broken
Ingest and Youtube callback are the most fragile, so check them first (they both live on the same node, so it's relatively easy). The logs on each machine can provide some context, as a recent code change might be breaking something.
Alerting:
NewRelic
New relic records are available via the 'veda_production' app on the edX newrelic site. There are currently no push alerts set up.
https://rpm.newrelic.com/accounts/88178/applications/32434455
Email Alerts
An email alert is sent out via AWS SES if a process crashes. This is sent to the list at veda-dev@edx.org