Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

TLDR: Cap node count if possible, monitor collect size by transform.

Even with With no micro-optimization, the transform side delivers the large course in ~1.5s. Things we should consider:

...

Regardless of whether we go for a permanent storage or cache-only strategy, we will still be using celery tasks to run collect asynchronously in response to publish events during the normal course of operation (collects are expensive). The permanent storage option would also use memcached for most queries. We can think of the permanent storage model as a guaranteed cache-of-last-resort – the code serving API requests can assume that the results of a collect will be present and will never have to invoke a collect synchronously. The cache-only strategy will at some point encounter situations where the results of a collect are unavailable (or partially unavailable), and it has to choose between doing the work inline or failing the request.

Another way to think about it is that in the permanent storage approach, the API's source of truth is the Django model. With a cache-only strategy, the ultimate source of truth is the modulestore.

SituationStrategy with Permanent Storage (Django model, MySQL)Strategy with Cache-only (Django cache, memcached)
Bootstrapping

Management command. The biggest issue with this is that by default it'll be single threaded. Assuming an average collect phase of 5s and 500 courses in the catalog, we're looking at 40+ mins to run this command.

In general, I think we need to structure our management commands so that they emit the signals for the actual celery processing jobs (whether they be mock publishes or more targeted), so that we can better take advantage of worker parallelism.

Three optionsThe other important point here is that bootstrapping off of modulestore().get_courses() is insufficient, because it will miss CCX courses. We can, however, bootstrap off of course_overviews (which will be much faster to query anyhow).

Options:

  1. Management command. Assume Like permanent storage, we can assume that a missing cache entry means that course doesn't exist, and fail API requests that ask for that course in the meanwhile.
  2. Similar to #1, but instead of a management command, but have the failed API requests that fail launch trigger celery tasks to build the collect data.
    1. Someone could be a jerk and start flooding the API with bogus courses. I'm not sure if that could have enough of an impact to slow down real publish processing.
  3. Invoke collect synchronously if entry is missing.
    1. Will cause a latency spike, but the system should recover shortly, assuming we have enough gunicorn workers.
    2. If a course's collect grows so expensive that it exceeds gunicorn worker timeouts (30s default), a course might never be recoverable.
Error Recovery (publishes lost)Management command with a time argument, so it knows only to rebuild things in a certain time period. We can actually could also have the command check the publish dates from the modulestore as well, so that we can avoid doing unnecessary work.

Two optionsOptions:

  1. Management command.
  2. Set a timeout on cache entries and rebuild synchronously if the entry is missing. This will cause certain requests to be slow, but the system would eventually recover from lost publish notifications.

 

Data corruptionSame management command as above to rebuild. Possibly Django admin to manually remove.

Options:

  1. Management command.
  2. Switching cache key prefix (config) would effectively flush all old data entries.
Invalidate on code changeTODOTODOTransformers are versioned, so we could create a management command that collects just the missing information. Similarly, XBlock field data should be captured separately from each other. One thing to note is that collects and the transforms that use them don't have to go out in the same release. This does make things more complicated for other installs.Again, we have the choice of doing things the same way as permanent storage, or to push the work into the synchronous API request-reply.
Memcached cluster switchoverNo action needed. Momentary spike in MySQL traffic, but it shouldn't be enough to affect overall site latency.TODO - Will likely cause a site latency spike or temporarily failing many requests. Assuming that we have hundreds of active courses and relatively expensive collect phases, we could tie up many gunicorn workers. However, things would likely recover shortlySince memcached is the only place things are being stored, a cluster switchover is equivalent to Boostrapping, and we can pick one of those strategies.
DebuggingDjango Admin could give basic information about what was collected.Would need to rely on debug functionality built into the API to access the collected data. Harder to reproduce for local testing.

Recommendations

My inclination is to go with permanent storage, because it reduces user-facing worst case behavior, and gives useful debug information. The scale of storage is also small enough where it shouldn't be burdensome.

In terms of management commands to rebuild things, I'd like to create a generic publish-signal command, and make it the responsibility of the individual listening tasks to determine whether or not they need to do work, and how much (e.g. collecting missing pieces). I've gone back and forth on this, but I'm afraid of having too many code paths, or forcing people upgrading from Cypress to Elm to run five different bootstrapping scripts (course overviews, course structures, block transforms, etc.).