Submission Archive - Fix handling of ghost files
TL;DR - Ghost files can appear in submission archive for students who experience upload failures. Update submission archive to not include (and warn) about missing files.
Why does this happen?
Attempting to upload a file saves file info in student state before it uploads it to S3. On failure, the failed upload is not cleared from the student state.
This can create a mismatch: file pointers exist in student state that don't exist in S3. We hide these on the frontend so they are invisible to both the student and reviewers but the data is still there.
When we generate the submission archive, we get the file path (edx-ora2/openasessment/data.py) by combining the LMS_ROOT_URL and the path-to-file. Naturally, if the path does not exist, we end up just getting the LMS_ROOT_URL!
These "ghost files" have the name of the failed file upload but contain the HTML of the edX home page.
The causes instructors to see extra files in the submission archive, which appear as broken files due to the filetype mismatch (a PDF or PNG extension for HTML code), and can make it hard to determine the correctly uploaded file.
If a file is not found (hint,get_download_url() raises or returns empty) the submission archive should omit the file
The Download CSV contains all files, noting the ones that failed to upload.
Log events for failed file retrievals so we can make more informed decisions on a fix moving forward.
Steps to Reproduce
Reason for Variance
User Impact Summary
, here’s a related CR for some file submission failures w/ some comment threads about the submission archive.