Asynchronous Order Fulfillment

Motivation

In Otto, order fullfillment currently works synchronously, triggered by a django signal that is fired upon the completion of an order.  The following are motivations for making fulfillment work asynchronously instead:

  1. Performance: when servicing requests for free products (e.g. honor mode seats), the user waits not only while Otto records the completion of the order, but also for additional server-to-server requests involved in fullfilling the order.  From the standpoint of user-perceived latency this is suboptimal; from the perspective of system this significantly reduces throughput since it ties up processes waiting for responses from external services, that could otherwise be servicing incoming requests.
  2. Robustness: under the synchronous model, attempts to fulfill orders either succeed or fail, but there is no support for retrying failed attempts.  In the case of immediate fulfillment (i.e. honor mode seats) the user could be made aware of the error and encouraged to retry their selection later.  However, when order completion happens offline (e.g. callback from payment provider), failed attempts to fulfill the order will simply leave the order in a failed state.
  3. Operational Sanity: at present, users interact with Otto via the lms; the synchronous enrollment api calls that happen during fulfillment in Otto call back to the same lms (a bit more info here).  These circular, blocking calls expose the integrated system to a scenario in which degradation or failure on one side may lead to slowdowns or deadlocks on both sides.

To summarize, adding asynchronous fulfillment capability will give us a scalable capacity to service orders insulated from the impacts of transient exceptions (network outages, maintenance windows / service disruptions).  It will also better equip us to withstand sudden spikes in request volume and isolate their impacts from end users, or other services within the integrated system.

Recommendation

Summary:

  • add celery to the ecommerce stack.
  • provision a durable message queue and worker processes for processing order fulfillments as celery tasks.
  • introduce a 'pending' state to the LMS' notion of course enrollments.  this is intended to facilitate a better user experience under potential side effects introduced by asynchronous processing (see below).

Considerations

Development / Testing Impact

Changing the call/response interaction would have a drastic effect on existing tests and development workflows.  This impact can be largely neutralized:

  • celery tasks can be configured per-runtime to execute in a blocking fashion (documentation).  This setting would be used in non-deployment scenarios, avoiding the need for provisioning queues or worker processes on developer laptops, and should preserve the vast majority of existing behavior of the system.
  • existing fulfillment code does not need to change, so existing unit tests can still execute that code synchronously.  Fulfillment can define some new celery task wrappers that simply delegate to the existing functions, and these can be invoked when orders are completed.

That said, it may be necessary to make changes to the acceptance tests we currently use to validate staging deployments.

Operational Impact

In edx.org and "fullstack" deployments there is already queue and worker infrastructure in place to service celery tasks owned by other parts of the system.  There's an optimistic assumption that ecommerce requirements could be met by adding configurations to this existing infrastructure.

One particular impact to be prepared for is the increased complexity of troubleshooting / debugging.  It is an unfortunate consequence of asynchronous processing that bugs and misconfigurations are more difficult to track down, since there may be no user-facing interaction or feedback to help diagnose mysterious behavior.  To mitigate, we must ensure that logging, instrumentation, and monitoring are sufficiently comprehensive and verbose.

User Impact

Users' perceived response times involving interactions with Otto should be faster after this change, and Otto will be more available to service high volumes of requests.

On the downside, however, this change will introduce the possibility that a user may visit their dashboard after successfully completing a transaction via Otto but before their order (course enrollment) has been fulfilled.  This is particularly likely with Otto checkouts involving honor-mode (free) seats, since the user will be instantly redirected to their dashboard on completion of the order, and there's a good chance the fulfillment task will not yet have successfully completed before that time.

To mitigate this problem, we should start to create enrollments with a new status of 'pending' when registration happens via the LMS, and have the enrollment api clear this status when the user's enrollment is made active during fulfillment.  This will permit us to display the enrollment immediately on the dashboard, with a link into the course disabled and accompanied by an indication that the seat is not yet available.  This will reassure the user that their request has been acknowledged by the system and is in progress.  While the user is viewing their dashboard, a script on running in the browser would check the server frequently for updates to the pending status, which would generally be expected to become active within a second or two - the client could then refresh the display in-place and permit the user to the fully active enrollment.