Course Blocks API Storage/Cache Requirements

Ticket: MA-1301 - Getting issue details... STATUS

Transformers have collect and transform phases. The idea is to do all the heavy lifting during collect (including all modulestore access), and have a fast, lightweight, chained transform phase. The data we currently gather during collect for a course is as follows:

Data Type	Description	Example	Small Course		Large Course
Structure	Mapping of Usage Keys to children.		26K	9K	560K	200K
Transformer	Data that is not specific to a particular block. Any course-level configuration that might affect transform behavior would go here.	The list of user partitions defined for a course.	<1K	<1K	3K	1K
Per-Block Transformer	Data that the Transformer creates for a given block. All data is intended to be denormalized here, so that there's never a need to crawl up to ancestor blocks to make a decision – i.e. to shift work to happen in the `collect` phase and leave the performance critical `transform` phase as simple as possible.	Mapping of block keys to lists of groups that can see that block.	91K/42K	66K/14K	4MB/1.1MB	2.6MB/385K
XBlock Field	XBlock fields that Transformers request. Each transformer can ask for an arbitrary number of fields, which are then consolidated and collected by a separate mechanism. It's possible that this could be modeled as a Transformer collect as well.	`graded`, `display_name`, etc.	8K	3K	275K	96K
			126K/77K	79K/27K	4.8MB/1.9MB	2.9MB/682K

For reference, I measured the edX Demo course as an example of "small" (141 block nodes), and MITx's 8.Mech as "large" (4195 block nodes). The two columns under each course represent the uncompressed pickle size and compressed pickle size. The per-block transformer data has two values for each box, indicating the space required to store the blocks individually vs. all together. Storing things together gives a much smaller overall size, but means that you can't render a sub-set of blocks without grabbing the entire thing (a problem we've seen with split mongo). More on this later.

Short term recommendation: Compress data before caching, and store per-block transformer data together instead of doing a multi-set by block. This should bring our large course data usage and per-request data fetch from 4.8MB to 682K.

Usage Questions

1. How much space is used by the collect phase per course and block? Total expected usage given typical course sizes? (Don't forget CCX here.)

TLDR: 1MB / course, including CCX courses. Single digit GBs overall on prod.

Storage requirements are going to be dominated by per-block transformer data and very likely XBlock field data. Structure data is currently larger than XBlock field data, but structure a) is more bounded; and b) can be cut down significantly with more efficient representation. On the other hand, each new Transformer we build can add data to the other types. With the API as it currently stands, the data points we have are 27K to 682K, with most courses likely in the low hundreds of KB. That being said, we can easily imagine extra data inflating this to where 1MB courses would not be unusual.

~100 bytes / block of compressed data per block, assuming block data is stored together in one chunk. Almost 7X that if stored separately.
Assume that as we add more data, 1MB courses will not be uncommon.
A CCX course counts as a course for these purposes.
We have no need to keep old versions around, so permanent storage need would be (1MB X number of courses) and cache needs would be (1MB X number of actively used courses). Even with some padding, we're looking at single digit GBs for the next year (unless CCX somehow becomes wildly popular, but even then, 10K CCX courses = ~10GB of storage, so...)

2. How often will updates be pushed?

New Relic shows peaks of 3-10 publishes / min, with many flat periods. That being said, publishing an update to a course that a CCX is built on will cause a cascade of publishes to be fired (one for every derived CCX), so we could see sudden bursts depending on how popular that feature is.

3. How many fetches and how much collected data is required to render the API call for the course or various subsets?

Currently, our "large" course requires 4.8MB of data fetching, but this goes down to < 700K with minimal changes. We can do this in one multi-get, possibly two if we have to split the block data up. We have > 4X of headroom before we start running into size limits on our memcached cluster for per-block transformers, and even that should be easy to split. Right now, we'd still fetch the whole thing to render a subset, but we do have a path towards enabling more efficient rendering in the future.

4. Are there areas where we need to enforce limits at an API level to guarantee performance?

TLDR: Cap node count if possible, monitor collect size by transform.

With no micro-optimization, the transform side delivers the large course in ~1.5s. Things we should consider:

Capping the total number of nodes on import. 5K would be great, but with some adjustments, the design may hold up through 10K nodes.
Limiting the allowed size of the amount of data that can be stored per Transformer and Transformer+block. Right now we're averaging only a couple hundred bytes of per-block information, with a max of around 1.4K (mostly video student_view_data info). It's probably most reasonable to give Transforms a ceiling on how much data they can store in their collect phase, but it's unclear what to do when things go over. Log a warning? It's being done asynchronously, so we can't somehow stop the publishing. For now, I think this is a metric we should just monitor.
We spend most of our transform time in graph traversal code. It's possible that this could be sped up, simplified, or cached.

Comparing Permanent Storage vs. Cache-only

Regardless of whether we go for a permanent storage or cache-only strategy, we will still be using celery tasks to run collect asynchronously in response to publish events during the normal course of operation (collects are expensive). The permanent storage option would also use memcached for most queries. We can think of the permanent storage model as a guaranteed cache-of-last-resort – the code serving API requests can assume that the results of a collect will be present and will never invoke a collect synchronously. The cache-only strategy will at some point encounter situations where the results of a collect are unavailable (or partially unavailable), and it has to choose between doing the work inline or failing the request.

Another way to think about it is that in the permanent storage approach, the API's source of truth is the Django model. With a cache-only strategy, the ultimate source of truth is the modulestore.

Situation	Strategy with Permanent Storage (Django model, MySQL)	Strategy with Cache-only (Django cache, memcached)
Bootstrapping	Management command. The biggest issue with this is that by default it'll be single threaded. Assuming an average collect phase of 5s and 500 courses in the catalog, we're looking at 40+ mins to run this command. In general, I think we need to structure our management commands so that they emit the signals for the actual celery processing jobs (whether they be mock publishes or more targeted), so that we can better take advantage of worker parallelism. The other important point here is that bootstrapping off of `modulestore().get_courses()` is insufficient, because it will miss CCX courses. We can, however, bootstrap off of course_overviews (which will be much faster to query anyhow).	Options: Management command. Like permanent storage, we can assume that a missing cache entry means that course doesn't exist, and fail API requests that ask for that course in the meanwhile. Similar to #1, but instead of a management command, but have failed API requests trigger celery tasks to build the collect data. Someone could be a jerk and start flooding the API with bogus courses. I'm not sure if that could have enough of an impact to slow down real publish processing. Invoke collect synchronously if entry is missing. Will cause a latency spike, but the system should recover shortly, assuming we have enough gunicorn workers. If a course's collect grows so expensive that it exceeds gunicorn worker timeouts (30s default), a course might never be recoverable.
Error Recovery (publishes lost)	Management command with a time argument, so it knows only to rebuild things in a certain time period. We could also have the command check the publish dates as well, so that we can avoid doing unnecessary work.	Options: Management command. Set a timeout on cache entries and rebuild synchronously if the entry is missing. This will cause certain requests to be slow, but the system would eventually recover from lost publish notifications.
Data corruption	Same management command as above to rebuild. Possibly Django admin to manually remove.	Options: Management command. Switching cache key prefix (config) would effectively flush all old data entries.
Invalidate on code change	Transformers are versioned, so we could create a management command that collects just the missing information. Similarly, XBlock field data should be captured separately from each other. One thing to note is that collects and the transforms that use them don't have to go out in the same release. This does make things more complicated for other installs.	Again, we have the choice of doing things the same way as permanent storage, or to push the work into the synchronous API request-reply.
Memcached cluster switchover	No action needed. Momentary spike in MySQL traffic, but it shouldn't be enough to affect overall site latency.	Since memcached is the only place things are being stored, a cluster switchover is equivalent to Boostrapping, and we can pick one of those strategies.
Debugging	Django Admin could give basic information about what was collected.	Would need to rely on debug functionality built into the API to access the collected data. Harder to reproduce for local testing.

Recommendations

My inclination is to go with permanent storage, because it reduces user-facing worst case behavior, and gives useful debug information. The scale of storage is also small enough where it shouldn't be burdensome.

In terms of management commands to rebuild things, I'd like to create a generic publish-signal command, and make it the responsibility of the individual listening tasks to determine whether or not they need to do work, and how much (e.g. collecting missing pieces). I've gone back and forth on this, but I'm afraid of having too many code paths, or forcing people upgrading from Cypress to Elm to run five different bootstrapping scripts (course overviews, course structures, block transforms, etc.).

Next Steps: More Efficient Caching

Goals

Store data as compactly as possible.
Don't query data for nodes we don't need (e.g. all the course's structure data to render a vertical).
Don't query data for transforms we don't need.
Data should be stored such that we can rebuild individual Transformer collects without rebuilding everything.

Proposal

A Block Cache Unit has each of the following, stored as a separate key:
1. structure
2. one key for each Transformer's collect, for all the nodes in the structure (inside would both transform-level data as well as per-block data).
3. one key for each requested XBlock field
Two tiers:
1. Course wide BCU in permanent storage.
2. Memcached sub-tree BCUs for any particular requested part of the course, quickly derived from the course level BCU while serving requests.
Packing data:
1. Because the collected data always has an associated BCU, we don't actually need to store the location keys in anything but the structure. Everything else can assume an order based on sorting the keys found in the structure, and use an array to represent values. This has a huge impact for our ability to separate out collect data, since for many transformers, the location key is actually much larger than the data they want to collect.
  1. Another possibility is using something like xxHash to represent the locations for really sparse data, but I don't think that's necessary at this time.
2. We can exploit the repetitive or sparse nature of many fields. For instance, if we store 4K entries of "graded":false in the middle of a lot of other attribute data, it can take up a fair amount of space. However, when flattened out into a list with just this attribute's values, the large course (which has 900+ graded items) compresses down to 72 bytes. Doing this on the XBlock Field data for the large course brought the compressed size of that section down from 96K to 23K. Extrapolating out from that (and assuming that we eliminate some redundant structure data), we could have a 4X improvement storage on the part of our system that's most likely to grow.