Target releaseRelease name or number
Epic
Document status
Document owner

dmitchellR (Deactivated)

DesignerLead designer
DevelopersLead developer
QALead tester
Legend: the goals have (H | M | L) next to them to represent priority of goal. H implies must have in MVP. M is borderline MVP v later. L is need for full functional system but not MVP.

Goals

Glossary

asynchronous task is a request handler which does not block the user while it runs. The user may navigate to other pages or log out without losing her/his ability to retrieve results later.

job output are the optional resulting file(s) from the asynchronous task. These include reports, tarballs, images, etc.

job is a specific run of an asynchronous task

job information is the record of who submitted the job when, with what parameters, and the statuses of each step in the job.

Background and strategic fit

Existing Problems:

Existing asynchronous tasks:

High level engineering spec

Future possible extensions

Finite State Machine

There are 2 flavors: the FSM controls the flow or the FSM merely tracks and provides introspectability. The state granularity should be the finest granularity of state tracking and restart which we need but no finer. In the best of all worlds, we'll be able to make each transition a separate celery or remote service call. Some transitions may be calls to external services and not necessarily go through celery. The catch will be if we need to report states at a finer granularity than makes sense as separately queued tasks (e.g., because one step sets up in memory structures consumed by the next step and thus must run in the same request on the same worker.)

The FSM package must

There may be other solutions than FSM (e.g., petri net, task flows)

Is an FSM an overkill?

Adding an FSM will be expensive. It gives us introspectability which enables better reporting to the user about what has happened and what will happen. It also enables better operations control (checking the state and progress of jobs and restarting). It gives us clear exception handling (exception states). It also provides us a clean place to attach a notification engine (call the notification engine upon entry and exit of most states: finished processX (success), starting processY(args)). By being declarative, it makes versioning more obvious.

The downsides to an FSM are time for development, whether there are sufficiently competent available packages, can they handle the level of parallelism we need for fanout, and will the overhead of instantiating and recording jobs overly tax the system? Regarding the latter, if the recording is roughly approximate to the notification granularity, then ipso facto it is purely a constant increase to that mechanism. However, for jobs like bulk email, we probably don't want notification on each transition only on entry to exception states; so, the proportionality would be different.

For now the FSM is tabled.

So, what is the alternative?

We will define a standard pattern for using celery along with storage and notifications. To whatever extent possible, we will autowire notifications into celery job states (overridable to prevent spamming on fine-grained tasks). We may try to add a thin interface over the celery functions so that we can replace them later with a separate FSM, but that's not a priority. The fact that we won't be using a declarative system makes upgrades a bit more difficult, but we'll implement a set of standards and a set of expected functions for handling job migration.

As part of the framework, we could define certain methods which teams should implement like cancel, status, list_jobs, find_outputs, restart_stopped_jobs. We could even create a pseudo-FSM reporting mechanism such as get_job_states which may or may not return a graph of state nodes (description, children).

Rough API

This is a rough draft of the API functions (which can have restful analogs). Final specs will come out of each task as we begin or prepare for implementation.

Output storage

What do we need for admin (devops) access and control?

Job control

I'm leaving job definition out for now until we decide whether to use an FSM

 

Requirements

#TitleUser StoryImportanceNotes
2Celery templateDevelopers must be able to create new job descriptors which declare their states and provide celery task code for each state 

Must pick or instantiate a worker (should allow eventual remoting)

Should require only a minimal subset of edx-platform (or none). Should show users how to include what they need of edx-platform without pulling all of it in.

3Minimal Notification frameworkThe system must notify users when jobs complete or abort Could be triggered by celery state transitions
4State tracking via notificationsUsers must be able to see where each job is in its sequence of tasks as well as all exceptions Ditto re triggering
5Results storageUsers must be able to securely and efficiently retrieve their job outputs (if any) 

Pluggable storage with an authorization model for access perms.

First version could delete upon download but should be built with the expectation that we'll add output lifecycle at some point.

6UpgradabilityBuilding on celery template, the system should define how to upgrade job definitions with one version backward compatibility  
7Create hello world skeleton appAs a developer, I want a sample task definition showing how to build my application and do updates  
8Refactor import to new async patternImport should use the new async pattern  
9Refactor export to new async patternExport should use the new async pattern  

User interaction and design

Include any mockups, diagrams or visual designs relating to these requirements.

Questions

Below is a list of questions to be addressed as a result of this requirements document:

QuestionOutcome
Do we need a storage system for uploads (e.g., staging the tar.gz for import)?Communicate the decision reached
Are there performance issues we need to worry about? 
How do instructor_task & analytics pipeline inform or integrate with this design 
Where does the server service live? lms, cms, another app? 
How do we configure the workers so they don't have all of edx-platform? 
How to integrate launchers into dashboards? 
Semantics/expectations for cancel and restart 

Not Doing