Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Develop a base pattern & framework for asynchronous tasks to use
    • An asynchronous task is one which does not block the user while it runs. The user may navigate to other pages or log out without losing her/his ability to retrieve results. (H)
      • The pattern must provide a common way to report failures. (H)
      • The pattern must provide a common way for the user to find the job output--any resulting files (reports, tar balls, etc). (H)
      • The system should use the same notification mechanism to report status and results as the system uses for other asynchronous notifications. (H)
      • The user must be able to check or be notified about (pull v push) the status of the task. (M)
      • The pattern must handle massive fanout such as one job per student without falling over while still supporting restart, upgrade, failover, etc. (M)
      • The reusable pattern must provide a common way to declare the states and state transitions of jobs so that GUIs can show the user what steps will occur as well as which have occurred along with dependencies. (M, need at least pending, in progress, complete, and failed/canceled)
      • The pattern should provide a common way to handle retries or alternative processes on exceptions (L)
      • The system should provide user control for cancelling jobs: (L)
        • cancelling some processes may leave the course in a broken state upon cancellation depending on the modulestore. We'll need to decide whether such processes should refactor.
          • Canceling an import in split will work right but in old mongo it will have made some but not all changes.
    • The platform should not block on queued jobs and could even possibly remote the job (H except remoting)
    • Jobs should be able to output reports, zip files, or other assets using a common storage, naming, and auth system (H)
    • The platform should allow devops or developers to add or configure other storage backends without changing any asynchronous task code (H)
    • We should decide on a retention policy for historical job information and perhaps make it configurable (job submission data & the finite state machine transition records) (H)
    • The storage system for job output:
      • Job output should associate w/ the job information (H)
      • Course teams should be able to delete job outputs (H)
      • Only authorized users should be able to access job output (H)
      • Course team members should be able to find job output from other team members or past jobs (perhaps configurable per job or by job type?) (L)
      • Task definition developers should be able to indicate that the output is use-once and should be deleted upon download v kept until course ended or a specific date or indefinitely (L)
      • The storage system should allow append so that small subjobs can append results to a base file created by the parent process (M)
    • The user should be able to retrieve large results without streaming them through the app server (the app should use a storage system accessible to the webserver w/ appropriate auth) (M, piggyback on assets approach for storage only)
    • Asynchronous task definitions (the code defining the task) should be in separate repos without requiring all of edx-platform
      • Asynchronous task definitions should be separately deployable (deployable w/o shutting down the platform) (L)
      • There should be a sound method to upgrade task definitions without breaking queued or running jobs (H)
    • The system should include manual job invocation and status checking via command line (L)
    • The framework should include testing and debugging documentation or aids (_)
  • Implement an example using this pattern which others can easily copy to adapt to their use (_)
  • Implement example for massively parallel subtasks (like bulk email) which iterate over some cursor
  • Implement this pattern for import and export (H)

...

  • Jobs will use celery &/or remote services. The existing framework for instructor_task may be a good starting point. Evaluating its fit will be part of the project plan.
  • We will develop a pattern of use which allows independent deployability, remoting, and upgrade in place. Not all of these must exist in the first version, just be possible.
    • Upgrade: use a queue per release, each with their own workers. Once the queue is empty, the system kills the workers.
  • We will use celery & add-ons to do state introspection, high-availability
  • We will define an interface for notifications & result delivery allowing multiple back-ends
    • We will define an auto-notifier on process initiation and termination which jobs can disable (to prevent flooding for highly decomposed jobs)
  • We will define a mechanism for handling fine-grained highly decomposed jobs so that they do not require proportional memory for tracking (e.g., bulk email w/ one job per recipient)
  • We may define a thin interface over celery and its add-ons so that we can replace celery with something which can also handle remote services (FSM)
  • We will need to come up with a versioning mechanism for job workers w/ some level of backward compatibility (capable of handling jobs for the most recent previous version as well as the current version). (could be later release but design should be aware of this future requirement)
  • Storage: we will use django-storages, but we need to figure out all of the above functional requirements on top of the storage. Ensuring proper authorization is crucial (can't have unauthorized users find and download results just due to url leakage or phishing). I don't see anything in django storage for handling auth.

...

QuestionOutcome
Do we need a storage system for uploads (e.g., staging the tar.gz for import)?
Are there performance issues we need to worry about? 
How do instructor_task & analytics pipeline inform or integrate with this design 
Where does the server service live? lms, cms, another app? 
How do we configure the workers so they don't have all of edx-platform? 
How to integrate launchers into dashboards? 
Semantics/expectations for cancel and restart 

Not Doing