Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We considered celery retries as an alternative, but rejected them because they do not provide the same level of robustness.  The task will be dropped each day unless a retry is explicitly triggered.  We would prefer a system that continues to retry until it knows it has succeeded over a system that retries as long as it knows it is failing. 

Limitations

  • If the management command to retry failed tasks is automatically run, this system performs extra work.  All unresolved tasks are rerun daily whether or not a fix exists.  As we intend the grade system to be robust, any failures will be considered high priority (CAT-2 or higher), and hopefully resolved quickly.

  • Grading tasks are run independently, which may result in more database queries than if they were batched together.  This is not a design problem, but an implementation issue.  It may be that further improvements could be made here in the future.

  • When a task retries, it will use the latest version of the course to calculate the grade.  If the course has changed or due dates have passed in the interim, this may cause incorrect results or unrecoverable errors.  This has always been an issue, but its severity has been reduced by the introduction of content hiding features which allow course teams to make content invisible to users, but still available to the grading infrastructure.  Resolving this fully would require the ability to access archival course versions from module-store, and the ability to bypass due dates (so long as the initial submission happened before the due date).

Silent: Exception not raised, Incorrect grade persisted

...

  • --courses: Reset grades for the list of courses provided.
  • --all_courses: Reset grades for all courses.
  • --modified_start: Reset only those grades that were updated since the given date.
  • --modified_end: Reset only those grades that were updated before the given date.
  • --delete: Actually reset the requested grades; that is, delete all the rows for the requested grades.
  • --dry_run: Just output the query results of the requested grades, without actually deleting them.

Recovery Scenarios

Here are a few example scenarios of types of bugs that may have escaped into production.

  1. Issue found in scoring grades for a particular block-type.  Bug was released on date1 and fixed on date2.
    • Find all affected courses using CourseGraph.
    • reset_grades --courses <course-list> --modified_start <date1> --modified_end <date2>
  2. Issue found in scoring grades for particular types of courses (e.g., with certain configuration settings).  Bug was released on date1 and fixed on date2.
    • Find all affected courses using CourseGraph.
    • reset_grades --courses <course-list> --modified_start <date1> --modified_end <date2>
  3. Issue found in scoring grades for any type of block or course.  Bug was released on date1 and fixed on date2.
    • reset_grades --all_courses --modified_start <date1> --modified_end <date2>

Limitations

  • This system will perform extra work.  All unresolved tasks will be rerun daily whether or not a fix exists.  As we intend the grade system to be robust, any failures will be considered high priority (CAT-2 or higher), and hopefully resolved quickly.  Running a management command would prevent this, but would require manual intervention.

  • Grading tasks are run independently, which may result in more database queries than if they were batched together.  This is not a design problem, but an implementation issue.  It may be that further improvements could be made here in the future.

  • When a task retries, it will use the latest version of the course to calculate the grade.  If the course has changed or due dates have passed in the interim, this may cause incorrect results or unrecoverable errors.  This has always been an issue, but its severity has been reduced by the introduction of content hiding features which allow course teams to make content invisible to users, but still available to the grading infrastructure.  Resolving this fully would require the ability to access archival course versions from modulestore, and the ability to bypass due dates (so long as the initial submission happened before the due date).
  • This design doesn't recover from bugs that don't raise exceptions.  For instance, if we introduced a bug that resulted in an incorrect grade value, there would be no exception, and the task would not be loggedThe management command simply deletes the impacted grades, rather than enqueuing tasks to update them.  This means the impacted grades for a course will not be recomputed and re-persisted unless they are explicitly updated (either when users submit a problem or when instructors trigger a re-score of a problem).  
    • A future version of the management command can address this by iterating through all impacted grades and enqueuing tasks to update them.  That change would require creation of a new celery task for recalculating grades (that takes in subsection information instead of problem block information) and load-testing the management command since the iteration process could be much slower (compared to a bulk-delete operation).