Grade Task Error Recovery

This document explains a system for recomputing grades when a system error occurs.  This could happen if external systems fail, or if a bug is introduced into the grading code.

Problem Statement

If there is an issue when updating a learner's (subsection or course) grade, either an incorrect grade is persisted or the grade isn't persisted at all (if an exception is thrown). It is important for grades to be persisted correctly, and any errors that occur must be recoverable.

Solution

The recovery strategy is different for when the issue is loud (raises an exception and fails to record a grade) or silent (fails quietly and records an incorrect grade without an exception).  These are described separately below.

Loud: Exception raised, Updated grade not persisted

Problem

Grades are updated in a background celery task. If an exception occurs in the grade recalculation task, that grade won't get persisted.  If a previous value is already persisted for that grade, the user will continue to see the old value until they submit another problem affecting that grade.  If that grade never gets updated again, the incorrect value will be persisted forever.

Solution

  1. Immediately after failure, we retry the failed task 3 times, with a delay of a few seconds.  This resolves many intermittent failures, such as:
    1. Transactional race condition with 2 processes updating grades for a user at the same time.
    2. Race condition where the background process tries to update the grade before the primary process completes its transaction.
  2. For failures that don't automatically correct themselves after 3 retries (as tried above), the following additional process is used:
    1. We automatically insert a record containing the relevant information for the failed task in a mysql table (arguments to the celery task, datetime attempted, the error that occurred, and a flag indicating that the problem has not been resolved).
    2. A management command is either manually run or automatically run (e.g., each day via celerybeat or a Jenkins job) that:
      1. queues a grade calculation task for each unresolved task found in the table.
      2. emails the relevant devs (tnl-beryl team) a notification that there are errors in the grading system.  (This could create JIRA tickets directly if desired, but would require more logic to ensure that dupes don't get created).
      3. deletes records that have been resolved for more than 30 days.
    3. For each task that was re-run:
      1. If the task succeeds, it will mark the record as resolved.
      2. If the task fails, no further action will be taken, and the record will remain in the queue to be retried later.

Rationale

The proposed solution is designed to be robust, and to require no manual intervention.

  • Storing the task in a table allows failed tasks to be found and requeued automatically.
  • It also ensures that errors do not get dropped until we explicitly set the resolution flag. 
  • Running all unresolved tasks daily ensures that we don't have to run a management command when a fix is published.

We considered celery retries as an alternative, but rejected them because they do not provide the same level of robustness.  The task will be dropped each day unless a retry is explicitly triggered.  We would prefer a system that continues to retry until it knows it has succeeded over a system that retries as long as it knows it is failing. 

Limitations

  • If the management command to retry failed tasks is automatically run, this system performs extra work.  All unresolved tasks are rerun daily whether or not a fix exists.  As we intend the grade system to be robust, any failures will be considered high priority (CAT-2 or higher), and hopefully resolved quickly.

  • Grading tasks are run independently, which may result in more database queries than if they were batched together.  This is not a design problem, but an implementation issue.  It may be that further improvements could be made here in the future.

  • When a task retries, it will use the latest version of the course to calculate the grade.  If the course has changed or due dates have passed in the interim, this may cause incorrect results or unrecoverable errors.  This has always been an issue, but its severity has been reduced by the introduction of content hiding features which allow course teams to make content invisible to users, but still available to the grading infrastructure.  Resolving this fully would require the ability to access archival course versions from module-store, and the ability to bypass due dates (so long as the initial submission happened before the due date).

Silent: Exception not raised, Incorrect grade persisted

Problem

In the event there is a bug released in the wild that results in an incorrect grade being computed and saved, we must retroactively update the impacted persistent grades once the bug is discovered and fixed.   

Solution

Once the fix is deployed to production, manually run the reset_grades management command with the following options:

  • --courses: Reset grades for the list of courses provided.
  • --all_courses: Reset grades for all courses.
  • --modified_start: Reset only those grades that were updated since the given date.
  • --modified_end: Reset only those grades that were updated before the given date.
  • --delete: Actually reset the requested grades; that is, delete all the rows for the requested grades.
  • --dry_run: Just output the query results of the requested grades, without actually deleting them.
  • --db_table: Choose a specific table from which to delete grades. Choose either 'subsection' or 'course.' If this option is absent, the command will apply to both tables.

Recovery Scenarios

Here are a few example scenarios of types of bugs that may have escaped into production.

  1. Issue found in scoring grades for a particular block-type.  Bug was released on date1 and fixed on date2.
    • Find all affected courses using CourseGraph.
    • reset_grades --courses <course-list> --modified_start <date1> --modified_end <date2>
  2. Issue found in scoring grades for particular types of courses (e.g., with certain configuration settings).  Bug was released on date1 and fixed on date2.
    • Find all affected courses using CourseGraph.
    • reset_grades --courses <course-list> --modified_start <date1> --modified_end <date2>
  3. Issue found in scoring grades for any type of block or course.  Bug was released on date1 and fixed on date2.
    • reset_grades --all_courses --modified_start <date1> --modified_end <date2>

Limitations

  • The management command simply deletes the impacted grades, rather than enqueuing tasks to update them.  This means the impacted grades for a course will not be recomputed and re-persisted unless they are explicitly updated (either when users submit a problem or when instructors trigger a re-score of a problem).  
    • A future version of the management command can address this by iterating through all impacted grades and enqueuing tasks to update them.  That change would require creation of a new celery task for recalculating grades (that takes in subsection information instead of problem block information) and load-testing the management command since the iteration process could be much slower (compared to a bulk-delete operation).