Asynchronous tasks pattern

Target release
Epic	PLAT-161 - Getting issue details... STATUS
Document status	DRAFT
Document owner	dmitchellR (Deactivated)
Designer
Developers
QA

Legend: the goals have (H | M | L) next to them to represent priority of goal. H implies must have in MVP. M is borderline MVP v later. L is need for full functional system but not MVP.

Goals

Develop a base pattern & framework for asynchronous tasks to use
- An asynchronous task is one which does not block the user while it runs. The user may navigate to other pages or log out without losing her/his ability to retrieve results. (H)
  - The pattern must provide a common way to report failures. (H)
  - The pattern must provide a common way for the user to find the job output--any resulting files (reports, tar balls, etc). (H)
  - The system should use the same notification mechanism to report status and results as the system uses for other asynchronous notifications. (H)
  - The user must be able to check or be notified about (pull v push) the status of the task. (M)
  - The pattern must handle massive fanout such as one job per student without falling over while still supporting restart, upgrade, failover, etc. (M)
  - The reusable pattern must provide a common way to declare the states and state transitions of jobs so that GUIs can show the user what steps will occur as well as which have occurred along with dependencies. (M, need at least pending, in progress, complete, and failed/canceled)
  - The pattern should provide a common way to handle retries or alternative processes on exceptions (L)
  - The system should provide user control for cancelling jobs: (L)
    - cancelling some processes may leave the course in a broken state upon cancellation depending on the modulestore. We'll need to decide whether such processes should refactor.
      - Canceling an import in split will work right but in old mongo it will have made some but not all changes.
- The platform should not block on queued jobs and could even possibly remote the job (H except remoting)
- Jobs should be able to output reports, zip files, or other assets using a common storage, naming, and auth system (H)
- The platform should allow devops or developers to add or configure other storage backends without changing any asynchronous task code (H)
- We should decide on a retention policy for historical job information and perhaps make it configurable (job submission data & the finite state machine transition records) (H)
- The storage system for job output:
  - Job output should associate w/ the job information (H)
  - Course teams should be able to delete job outputs (H)
  - Only authorized users should be able to access job output (H)
  - Course team members should be able to find job output from other team members or past jobs (perhaps configurable per job or by job type?) (L)
  - Task definition developers should be able to indicate that the output is use-once and should be deleted upon download v kept until course ended or a specific date or indefinitely (L)
  - The storage system should allow append so that small subjobs can append results to a base file created by the parent process (M)
- The user should be able to retrieve large results without streaming them through the app server (the app should use a storage system accessible to the webserver w/ appropriate auth) (M, piggyback on assets approach for storage only)
- Asynchronous task definitions (the code defining the task) should be in separate repos without requiring all of edx-platform
  - Asynchronous task definitions should be separately deployable (deployable w/o shutting down the platform) (L)
  - There should be a sound method to upgrade task definitions without breaking queued or running jobs (H)
- The system should include manual job invocation and status checking via command line (L)
- The framework should include testing and debugging documentation or aids (_)
Implement an example using this pattern which others can easily copy to adapt to their use (_)
Implement example for massively parallel subtasks (like bulk email) which iterate over some cursor
Implement this pattern for import and export (H)

Glossary

asynchronous task is a request handler which does not block the user while it runs. The user may navigate to other pages or log out without losing her/his ability to retrieve results later.

job output are the optional resulting file(s) from the asynchronous task. These include reports, tarballs, images, etc.

job is a specific run of an asynchronous task

job information is the record of who submitted the job when, with what parameters, and the statuses of each step in the job.

Background and strategic fit

Existing Problems:

Course import and export sometimes time out leaving courses in broken states if using old mongo.
Import, export, and the other long-running tasks block the main appserver workers.
Most existing uses of celery don't have recovery and exception handling logic
Each use of celery uses it's own state reporting mechanism.
There's no introspectable state machine for tasks (what states has this submission gone through and what ones are to come).
Result delivery is currently idiosyncratic.
The user has no "control panel" to view all tasks, all results, cancel tasks, find job outputs from prior tasks

Existing asynchronous tasks:

bulk email
certificate generation
grading (rescoring, reset attempts, delete state, calc csv)
data downloads (grades, student profiles)
ora2
discussion/forum notifier
course rerun
video upload and encoding (VAL, veda (sic?))

High level engineering spec

Jobs will use celery &/or remote services. The existing framework for instructor_task may be a good starting point. Evaluating its fit will be part of the project plan.
We will develop a pattern of use which allows independent deployability, remoting, and upgrade in place. Not all of these must exist in the first version, just be possible.
- Upgrade: use a queue per release, each with their own workers. Once the queue is empty, the system kills the workers.
We will use celery & add-ons to do state introspection, high-availability
We will define an interface for notifications & result delivery allowing multiple back-ends
- We will define an auto-notifier on process initiation and termination which jobs can disable (to prevent flooding for highly decomposed jobs)
We will define a mechanism for handling fine-grained highly decomposed jobs so that they do not require proportional memory for tracking (e.g., bulk email w/ one job per recipient)
We may define a thin interface over celery and its add-ons so that we can replace celery with something which can also handle remote services (FSM)
We will need to come up with a versioning mechanism for job workers w/ some level of backward compatibility (capable of handling jobs for the most recent previous version as well as the current version). (could be later release but design should be aware of this future requirement)
Storage: we will use django-storages, but we need to figure out all of the above functional requirements on top of the storage. Ensuring proper authorization is crucial (can't have unauthorized users find and download results just due to url leakage or phishing). I don't see anything in django storage for handling auth.

Future possible extensions

Finite-state machine: We will find or develop a finite-state machine (FSM) engine for generating jobs, submitting to celery, tracking status including exceptions and restarts.
Notifications: notifications will drive off the fsm and use the notification framework which solutions is building.
Management console: (future release)
- the user will need to either just use the notifications in-box or a separate management console for checking state, getting results, reviewing exceptions, and canceling tasks.
- we may eventually want to develop a console to allow team members to find other team members past or current jobs (results, status, cancellation)

Finite State Machine

There are 2 flavors: the FSM controls the flow or the FSM merely tracks and provides introspectability. The state granularity should be the finest granularity of state tracking and restart which we need but no finer. In the best of all worlds, we'll be able to make each transition a separate celery or remote service call. Some transitions may be calls to external services and not necessarily go through celery. The catch will be if we need to report states at a finer granularity than makes sense as separately queued tasks (e.g., because one step sets up in memory structures consumed by the next step and thus must run in the same request on the same worker.)

The FSM package must

allow the definition of multiple state machines (e.g., import, export, bulk_email)
- Should allow configuration & specification of celery tasks as well as of notifications
allow instantiation of multiple instances of each state machine (e.g., each import job)
allow code to retrieve the sequence and set of states for each instantiation with current state (introspect the fsm as well as get state info for each instance). This introspectability allows us to write user interfaces which show what's been done as well as what's left to do. It also allows us to debug and manage processes.
- should (not must) allow code to retrieve historical information (when it executed each transition)
- should allow administrative query of all fsm instances with query parms such as is in error state, has not completed but has started, is waiting to start
- should allow user specific query to find all running fsms for the current user
allow the machine to be a long-running asynchronous process: that is, not require that the process run in a single request but allow the server to cycle and restart the machine (which shouldn't necessarily restart each task but figure out if they're running)
- completion of a task should provoke the fsm to move to the next state
- on error, the fsm should transition to the correct error state not to the next process step
- we may need to mark transition completion on notification from the spawned process rather than on completion of the process if the granularity of states < granularity of celery tasks
  - in this case, the celery task may need to directly mark the transaction rather than the server doing so (load the fsm instance and mark the transition)
  - or we may need a notification mechanism from the tasks back to the server for it to do the transition marking
- the machine should be able to restart killed transitions

There may be other solutions than FSM (e.g., petri net, task flows)

https://wiki.openstack.org/wiki/DistributedTaskManagement
- well integrated with celery? (looks like it may be abandoned)
- base task flow can possibly handle non-celery remote calls
- simple branching flow specification
- not sure how to introspect
- persistence, restart, & notifications built into base task flow (https://wiki.openstack.org/wiki/TaskFlow)
- may bring in too many packages? (requirements loads a python 3 compatibility lib (six), stevedore, futures, jsonschema, oslo.utils, oslo.serialization)
http://docs.celeryproject.org/en/latest/userguide/canvas.html
- part of celery
- doesn't solve notifications nor introspection (? DependencyGraph appears to be a means to check the result after completion not the plan)
- no persistence w/ restart, no exception workflow but does pass exception through
https://github.com/nsi-iff/fluidity/blob/master/fluidity/machine.py
https://github.com/kmmbvnr/django-fsm
https://github.com/thomasquintana/declarative-fsm

Is an FSM an overkill?

Adding an FSM will be expensive. It gives us introspectability which enables better reporting to the user about what has happened and what will happen. It also enables better operations control (checking the state and progress of jobs and restarting). It gives us clear exception handling (exception states). It also provides us a clean place to attach a notification engine (call the notification engine upon entry and exit of most states: finished processX (success), starting processY(args)). By being declarative, it makes versioning more obvious.

The downsides to an FSM are time for development, whether there are sufficiently competent available packages, can they handle the level of parallelism we need for fanout, and will the overhead of instantiating and recording jobs overly tax the system? Regarding the latter, if the recording is roughly approximate to the notification granularity, then ipso facto it is purely a constant increase to that mechanism. However, for jobs like bulk email, we probably don't want notification on each transition only on entry to exception states; so, the proportionality would be different.

For now the FSM is tabled.

So, what is the alternative?

We will define a standard pattern for using celery along with storage and notifications. To whatever extent possible, we will autowire notifications into celery job states (overridable to prevent spamming on fine-grained tasks). We may try to add a thin interface over the celery functions so that we can replace them later with a separate FSM, but that's not a priority. The fact that we won't be using a declarative system makes upgrades a bit more difficult, but we'll implement a set of standards and a set of expected functions for handling job migration.

As part of the framework, we could define certain methods which teams should implement like cancel, status, list_jobs, find_outputs, restart_stopped_jobs. We could even create a pseudo-FSM reporting mechanism such as get_job_states which may or may not return a graph of state nodes (description, children).

Rough API

This is a rough draft of the API functions (which can have restful analogs). Final specs will come out of each task as we begin or prepare for implementation.

Output storage

save(name, content, auth)
url(name)
get(name, credtoken)
delete(name, credtoken)
list(credtoken)

What do we need for admin (devops) access and control?

Job control

I'm leaving job definition out for now until we decide whether to use an FSM

initiate(user, *args)
restart(job)
cancel(job)
list_jobs(user | auth)
status(job): [running | cancelled | aborted | success], [status_log | exceptions | output_id]

Requirements

#	Title	User Story	Notes
2	Celery template	Developers must be able to create new job descriptors which declare their states and provide celery task code for each state	Must pick or instantiate a worker (should allow eventual remoting) Should require only a minimal subset of edx-platform (or none). Should show users how to include what they need of edx-platform without pulling all of it in.
3	Minimal Notification framework	The system must notify users when jobs complete or abort	Could be triggered by celery state transitions
4	State tracking via notifications	Users must be able to see where each job is in its sequence of tasks as well as all exceptions	Ditto re triggering
5	Results storage	Users must be able to securely and efficiently retrieve their job outputs (if any)	Pluggable storage with an authorization model for access perms. First version could delete upon download but should be built with the expectation that we'll add output lifecycle at some point.
6	Upgradability	Building on celery template, the system should define how to upgrade job definitions with one version backward compatibility
7	Create hello world skeleton app	As a developer, I want a sample task definition showing how to build my application and do updates
8	Refactor import to new async pattern	Import should use the new async pattern
9	Refactor export to new async pattern	Export should use the new async pattern

User interaction and design

Questions

Below is a list of questions to be addressed as a result of this requirements document:

Question	Outcome
Do we need a storage system for uploads (e.g., staging the tar.gz for import)?
Are there performance issues we need to worry about?
How do instructor_task & analytics pipeline inform or integrate with this design
Where does the server service live? lms, cms, another app?
How do we configure the workers so they don't have all of edx-platform?
How to integrate launchers into dashboards?
Semantics/expectations for cancel and restart

Platform