...
TLDR: Cap node count if possible, monitor collect size by transform.
Even with With no micro-optimization, the transform side delivers the large course in ~1.5s. Things we should consider:
...
Regardless of whether we go for a permanent storage or cache-only strategy, we will still be using celery tasks to run collect
asynchronously in response to publish events during the normal course of operation (collects are expensive). The permanent storage option would also use memcached for most queries. We can think of the permanent storage model as a guaranteed cache-of-last-resort – the code serving API requests can assume that the results of a collect will be present and will never have to invoke a collect synchronously. The cache-only strategy will at some point encounter situations where the results of a collect are unavailable (or partially unavailable), and it has to choose between doing the work inline or failing the request.
Another way to think about it is that in the permanent storage approach, the API's source of truth is the Django model. With a cache-only strategy, the ultimate source of truth is the modulestore.
Situation | Strategy with Permanent Storage (Django model, MySQL) | Strategy with Cache-only (Django cache, memcached) | |
---|---|---|---|
Bootstrapping | Management command. The biggest issue with this is that by default it'll be single threaded. Assuming an average collect phase of 5s and 500 courses in the catalog, we're looking at 40+ mins to run this command. In general, I think we need to structure our management commands so that they emit the signals for the actual celery processing jobs (whether they be mock publishes or more targeted), so that we can better take advantage of worker parallelism. Three optionsThe other important point here is that bootstrapping off of | Options:
| |
Error Recovery (publishes lost) | Management command with a time argument, so it knows only to rebuild things in a certain time period. We can actually could also have the command check the publish dates from the modulestore as well, so that we can avoid doing unnecessary work. | Two optionsOptions:
| |
Data corruption | Same management command as above to rebuild. Possibly Django admin to manually remove. | Options:
| |
Invalidate on code change | TODO | TODOTransformers are versioned, so we could create a management command that collects just the missing information. Similarly, XBlock field data should be captured separately from each other. One thing to note is that collects and the transforms that use them don't have to go out in the same release. This does make things more complicated for other installs. | Again, we have the choice of doing things the same way as permanent storage, or to push the work into the synchronous API request-reply. |
Memcached cluster switchover | No action needed. Momentary spike in MySQL traffic, but it shouldn't be enough to affect overall site latency. | TODO - Will likely cause a site latency spike or temporarily failing many requests. Assuming that we have hundreds of active courses and relatively expensive collect phases, we could tie up many gunicorn workers. However, things would likely recover shortlySince memcached is the only place things are being stored, a cluster switchover is equivalent to Boostrapping, and we can pick one of those strategies. | |
Debugging | Django Admin could give basic information about what was collected. | Would need to rely on debug functionality built into the API to access the collected data. Harder to reproduce for local testing. |
Recommendations
My inclination is to go with permanent storage, because it reduces user-facing worst case behavior, and gives useful debug information. The scale of storage is also small enough where it shouldn't be burdensome.
In terms of management commands to rebuild things, I'd like to create a generic publish-signal command, and make it the responsibility of the individual listening tasks to determine whether or not they need to do work, and how much (e.g. collecting missing pieces). I've gone back and forth on this, but I'm afraid of having too many code paths, or forcing people upgrading from Cypress to Elm to run five different bootstrapping scripts (course overviews, course structures, block transforms, etc.).