Backfill Strategy
We are filling grades data for all courses in edX.org. The task is run per course and for a batch of users. Our desire is to determine the batch size to use to optimize the data processing.
Constraints
Some factors we believe may affect our choice of batch time.
Constant overhead
Each task does a certain amount of work to generate shared data for each course, and then computes the grades for each individual user. Generating the shared data is a constant overhead to each block; generating that data for a batch of five users would take approximately the same amount of time as generating it for a batch of 1000 users. If the batch size is too small, this overhead may become a significant burden on processing.
This includes (primarily) collecting the block structures for the course. Nimisha is also investigating pulling user-specific sql data down once per batch, rather than once per user, as well as doing similar batching in how the data is saved. This will reduce the number of queries to a constant, but may still increase in cost with the number of users per batch.
Course complexity
Computing grades for a given student will take time proportional to the number of blocks and subsections in the course. Thus larger courses will take longer to process than small courses.
Course enrollment
The number of learners enrolled in the course will linearly increase the amount of time to process the entire course, but should not measurably affect the processing time for a single batch. Caveat: The last batch for each course will not be full, so that batch will run more quickly, but will spend proportionally more time on constant overhead. There are a number of courses with only a couple enrolled learners, which will only have a single unfilled batch. We do not believe this will materially affect our choice of batch size. However, if our batch size is too large, most tasks will be unfilled, and that will inhibit our ability to easily predict overall running time. Thus we want most batches to be filled.
Table size
As we process courses, the size of the indexes on the grades table will grow, slowing down later inserts. This table will be significantly smaller than, for instance, the CSM table, so we do not believe the slowdown will be significant enough to affect our plans.
Error handling
Given the amount of data involved, and the age of some of that data, we expect failures to occur. We plan to address these errors by correcting the issue (either in the data or in the grade computing task. A larger batch size will impose a larger burden on retrying failed tasks. If a single task takes an hour, each failure adds an hour of processing time. If each task takes one minute, then a single error will only cost a minute of processing time. Note this could be mitigated if we had a smarter error recovery strategy.
Worker count
More celery workers will be able to process the data more quickly. At some point, though, we will be limited by throughput on the database, or the impact we have on site performance for learners. We would rather take longer processing the tasks than degrade site performance.
Problems completed
If most users haven't completed many problems, the processing may be faster. There is a small risk that this will make stage data unreliable, as people often enroll in stage courses to test features without actually answering questions within the course.
Data collection
In order to validate our assumptions above, we plan to run several tests batches
We will select one small, one medium, and one large course (measured in terms of number of subsections and blocks) with a moderate number of enrolled learners (10-20 thousand). For each course we will compute the grades with batch sizes of 100 and 1000 learners.
In addition, to validate the hypothesis that total course enrollment does not affect batch processing time, we will run the same batches against a comparably sized (medium) course with a large number of learners (>100,000). This will probably be the demo course.
We also plan to run one of the tasks twice in succession to determine whether we need to delete the table between test runs on a given course.
Based on what we see, we may decide to do a couple more test runs with batch sizes other than 100 and 1000, or with more high or low enrollment courses.
Using the data
The data we collect will help us determine:
- How much benefit we get from increasing batch size (if run-time of a batch in a given increases sub-linearly with the number of users)
- How much course size affects processing time.
- Whether different batch sizes make sense for different course sizes.
- Whether total course enrollment affects batch processing time.
We expect that we will come out of this with either a single batch size to use for all courses, or a separate size for small, medium, and large complexity courses. We expect that we won't need to vary the batch size based on enrollments, but if we do, that may introduce an extra set of batch sizes.