2015.09.09 Asynchronous Task Processing

2015.09.09 Asynchronous Task Processing

Asynchronous Task Processing Architecture V2

In room: Clinton, Nimisha, Ed, Mark, Zach, Miki, Joel, Renzo, Ned.

On hangout: Cale, Dave, Felipe, Jim, Peter Pinch.

Action Items:

@Renzo Lucioni (Deactivated) specify task naming
@Renzo Lucioni (Deactivated) clarify that workers are per-IDA

 

  • What does ecomm need right now?

    • make order fulfillment more robust

      • retries in the case of system failure (modulestore down)

    • devops wants a way to do asynch tasks that doesn't mean pushing the whole repo to another worker: wanted a lightweight way to do async

    • devops wanted versioning to be sure new code could be pushed without fear

      • if we make these tasks much smaller, do we have the same concerns about version changes?

      • version skew was a concern that motivated some of the new design, and the new design definitely helps with it.

    • performance: fulfillment is a bottleneck, so ecomm wants to make it async

  • Concerns with the proposal:

    • dependency conflicts: the proposal has all the tasks running in one virtualenv, so they have to agree on dependencies

    • use of celery:

      • celery doesn't have a pub-sub notion

      • but this doesn't feel like a pressing concern

      • replacing celery feels out of scope

      • miki: fulfillment seems like an area that will keep growing

        • do we need pubsub to deal with it?

        • jim: the next six months don't need it

        • dave: pubsub could be added afterward

    • mixed workload:

      • could big jobs starve little jobs?

      • ed can imagine having a queue per task

        • ballpark: we'll have a few dozen kinds of tasks?

  • Dependency conflicts

    • if the tasks are API-oriented, then the set of requirements will be tiny

      • cale: fundamental worry

        • IDAs are meant to be independent

        • Should allow teams to work independently

        • Now their tasks have to agree on dependencies

    • Clarification: this proposal was intended to cover all edX async tasks eventually

    • tasks within a team don't need isolation

      • the team can coordinate requirements

    • If devops is OK with different workers for different teams, then that lessens the conflicts to manageable levels

    • Worker pool per IDA is OK with everyone

  • Does ecomm want to be able to deploy individual tasks without deploying all of ecomm?

    • Deploying tasks independent of the front end makes sense

    • Deploying task A separate from task B isn't needed

  • Versioning

    • is it enough to make a new task when the version changes?

      • ed is worried that task names will become gross, and wants a convention

      • [  ] specify how to name tasks to deal with versions

  • What data is passed to the task?

    • Pass ids of objects, not the objects themselves.

    • What if the object changes before the task runs?

      • Should this be decided universally? Or case-by-case?

    • Passing values means that you can detect if the data has changed in the meantime.

    • Passing values also makes debugging easier

    • Passing values makes idempotency harder

    • Passing a version number with a reference can make things easier.

  • Tasks should be idempotent

  • Debugging

    • multi-machine configuration makes debugging hard

    • The proposal includes running tasks in-process for development.

      • not enough: should also consider debugging in "more real" environments.

  • Operational monitoring:

    • Pro-active alerts. Queue getting full needs to raise alarms

    • this is part of an overall monitoring scheme

    • currently, ecomm relies on splunk. celery-flower is the new thing?

    • rabbit is also monitored now, via splunk

    • everything must be monitored

  • Error recovery

    • what if a worker drains a queue, but fails everything? What will retry those tasks?

      • now we have a manual process to replay orders

      • that would stay the same

    • tasks would be responsible for retrying

Renzo will update the document, implementation can begin.