Overview

As part of the project to replace Insights, we are looking at several different pieces of technology. A central technology is the data store, which has (or lacks) functionality that drives many other adoption decisions.

Ideally a data store for an Open edX LRS would meet the following criteria:

Free and open source
Lightweight enough to be reasonably run in Tutor
Scales down (can be run on commodity hardware for small installs)
Scales up (can handle billions of rows affordably)
Can be deployed in a fault tolerant way
Has a cloud offering for medium-large installs that don’t want to run their own

There are a few potential options that may check all of those boxes, but one that came recommended from a few people in the community was ClickHouse. So this issue was created to get hands-on experience with it and get an understanding if it can work for us.

Summary

The good…

ClickHouse is a very capable database that performed extremely well at a billion rows under Docker on my M1 Mac. It was easy to run and configure as a Tutor plugin. Using default configuration it took up little CPU and ~500mb of RAM, even fully loaded. That’s a lot of memory, but not out of line from other services we run in Tutor.

I was able to load the billion rows from a single-threaded Python script in just under 18 hrs. Under load, queries that had been optimized for (ex: “Count of enrollment events for a course”) completed in 0.09 - 0.13 seconds while loading the last 100 million rows. Un-optimized queries (“Count of enrollments for this learner”) completed in 10 - 12 secs. Details on this test are below.

I very much believe that ClickHouse could support even our largest install bases if configured correctly.

Native tools work well and were easy to set up, as was PyCharm / DataGrip integration.

There is native MySQL connectivity to import data, which could be helpful for synchronizing permissions or other data from our various MySQL sources.

There is native support for loading data from S3 and other stores.

We are considering Apache Superset for a display layer for this data, and integrating with it was trivial as it’s natively supported.

The bad…

ClickHouse is under fairly heavy development and seems to just be really reaching maturity for a product in this space. I was able to use windowing functions for displaying time series data in Superset, but it was fairly recently introduced and documentation around those functions was conflicting. There were a handful of features/functions like this (JSON column availability stands out).

Some SQL syntax can be confusing, some things are case-sensitve while others aren’t. Transactions aren’t supported and other deviations from the ANSI spec (https://clickhouse.com/docs/en/sql-reference/ansi ). None of these things prevented me from doing anything I needed to do, but there are important tradeoffs made to get the level of performance we see.

There are a lot of configuration options, some are very low level. Figuring out reference implementations and reasonable Tutor defaults will require a fair amount of work.

These are not unusual problems in this space, and I didn’t find any of them to be deal breakers.

The ugly…

The only truly unusual thing that I did in my test was the use of a Buffer table to cache data for bulk insert into the final materialized table. This is the recommended way to maintain performance on write-heavy tables and does introduce some complexity.

Namely that data can be delayed on the final table (but queries can be directed at the buffer table and will return results for the buffer and materialized table). Buffers are stored in RAM, and so can be prone to data loss in the event of a crash, though we should be planning for recovering from such situations anyway.

Otherwise I didn’t find anything particularly troublesome in ClickHouse and think it can be a very solid candidate for our use case.

Testing and Goals

I ran several increasingly larger rounds of testing by using https://github.com/bmtcril/xapi-db-load to generate fake xAPI events and batch load them to ClickHouse in a single thread. The goal wasn’t to test how quickly events could be loaded, but rather check the performance with a moderately large scale of data under pretty strict hardware constraints. In addition to the goals stated in the original Github Issue, I was hoping to answer the following questions:

Can it run loaded with a moderate amount of data in a Tutor environment for development and testing?
- Yes, I was able to run Tutor nightly dev with my ClickHouse plugin and Superset (run outside of Tutor) on an M1 Macbook Pro under Docker Desktop with 4CPUs and 10GB of memory assigned to Docker. This is a lot, but ClickHouse was far from the biggest resource consumer.
How difficult is it to code for in Python?
- It’s a fairly normal database API, I ran into some issues with parameterized queries on the bulk inserts, but I didn’t put much time into trying to figure it out. Everything else worked fine.
How complicated is it to stand up and configure a single node instance and connect locally?
- Using the Docker image the server was running in 5 mins and I was able to connect for a local shell in 10 using a binary client.
- Initially I ran into some Mac issues that caused the server to spam some warnings while inserting data that required me to override some settings, however a new version was pushed so that when I ran under Tutor that fixed the issue.
- The client and local tools are quite good and have some really unique features for things like ETL and bulk loading.
How complicated is it to set up as a Tutor plugin?
- It was pretty easy (https://github.com/bmtcril/tutor-contrib-clickhouse ). A lot of the effort was just learning how to write Tutor plugins in general, and trying to understand how to copy around static configuration files.
How hard is it to optimize for Insights-replacement use cases?
- I took 2 passes at optimization, one with a 100M row data set, one with 1B rows. In both cases I focused on table design (breaking different keys out of the JSON), primary key, and order by . The changes were fairly straightforward and had a huge impact, dropping times on non-optimized queries from over 30s to 10-12s. Further work can be done here, once a diverse set of use cases has been ironed out, but I believe it can be tuned at this level for even medium-large installations.
- For very large installs there is undoubtedly a lot of tuning required and options like clustering or cloud hosting are available, but that was outside the scope of this test.
- Notably, the primary key and order by of a table are not changeable in a table. A new table needs to be created and the old data loaded into it, so it is important to get this right the first time!
- Open question: Do we want “one big table” or does it make more sense to group statements by a secondary key (problem events, video events, navigation events, etc). The latter makes performance tuning easier at the cost of complicating data transformation and loading configuration.
How hard is it to integrate with Ralph?
- I did not do a test implementation, but based on the ElasticSearch and MongoDB Ralph backends I didn’t see anything concerning.
How hard is it to integrate with Superset?
- It was trivially easy to access data, however Superset’s security model is more of a challenge (see ).
How painful is it to operate on JSON fields?
- About the same as with Redshift / Postgres / Snowflake, that is to say pretty painful but doable.

Testing Methodology

I’m going to focus on the 1B row test here as previous rounds had some material issues.

Clickhouse Server

Version 22.10.2.11 from Docker image clickhouse/clickhouse-server:latest (9a075a7adb7c)
Run under Tutor using my plugin with default configuration
Only the ClickHouse plugin was running, so it had full access to Docker’s configured 4 CPUs and 10GB of RAM
I ran with two non-default options on my client:
- date_time_input_format="best_effort" Allows RFC dates to be ingested
- old_parts_lifetime=10 this is the number of seconds that old files hang around, I reduced it to save disk space because I knew the big run would be close to filling my disk

xapi.xpi_events_all Table

One table was used to collect all events, with a buffer table created to cache inserts for efficiency as recommended by ClickHouse
Some top level event data (actor id, problem id, video id, etc) were extracted, but not indexed on. This didn’t seem to make a huge performance difference.
course_run_id, verb, and emission_time were used for primary key and/or table sorting
The full event xAPI JSON was stored as well
Table DDLs are here:

Events

Courses, actors, problems, sequences, and videos were randomly generated in such a way that they could be reused and have the same courses / actors etc be used multiple times, creating a fairly evenly distributed number of events for each course / user / etc.
Events were distributed over a span of the last 5 years (2017/11 - 2022/11)
Orgs were selected randomly from a fixed list of 6
Events were created randomly according to the distribution seen on edx.org over a 2 month period, but included all types of event available in event-routing-backends at the time. The distribution used can be seen here.
Event JSON was string manipulated to include our randomized data using templates extracted from example test fixtures in event-routing-backends.
Events were generated and inserted in batches of 10k

Insert run

Ran the load generation script with --num_batches 100000 and --drop_tables_first true
Runtime was 17 hrs, 57 mins, final size on disk was 289.35 GB
Time was collected every 1M rows inserted, given the scale it did not slow down terribly
- Early rounds of 1M took 1-4 secs
- Final rounds took 3-5 secs
Some queries were also run between every 1M rows
- Optimized queries that used course_run_id or verb consistently completed in 0.1 secs or less up to 900M rows
- Unoptimized queries that used org or actor and verb had execution times grow, but with as long as verb was used they leveled off around 10-12 secs up to 900M rows
- Unoptimized queries that used none of the indexed columns timed out at 30s, which is to be expected at this size

Done! Added 1000000000 rows!
[('UTC', datetime.datetime(2022, 11, 23, 12, 29, 25))]
Buffer table row count:
[(1000000000,)]
Hard table row count:
[(999590000,)]

Count of enrollment events for course http://localhost:18000/course/course-v1:chipX+DemoX+7e55ce84-b1fc-429e-ba81-e6da0de42a8d
{'read_rows': '229376', 'read_bytes': '33605593', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '229376', 'result_rows': '1', 'result_bytes': '256'}
[(70,)]
Completed in: 0.090815
=================================
Count of total enrollment events for org openedX
{'read_rows': '976654304', 'read_bytes': '22857207677', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '976654304', 'result_rows': '1', 'result_bytes': '256'}
[(2018628,)]
Completed in: 3.953353
=================================
Count of enrollments for this learner
{'read_rows': '976794304', 'read_bytes': '63556376368', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '976794304', 'result_rows': '1', 'result_bytes': '256'}
[(2,)]
Completed in: 12.779382
=================================
Count of enrollments for this course - count of unenrollments, last 30 days
{'read_rows': '425984', 'read_bytes': '64114078', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '425984', 'result_rows': '1', 'result_bytes': '1048832'}
[(2, 0, 2)]
Completed in: 0.136299
=================================
Count of enrollments for this course - count of unenrollments, all time
{'read_rows': '212992', 'read_bytes': '31205071', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '212992', 'result_rows': '1', 'result_bytes': '1048832'}
[(70, 0, 70)]
Completed in: 0.023204
=================================
Count of enrollments for all courses - count of unenrollments, last 5 minutes
{'read_rows': '1954973056', 'read_bytes': '104370532775', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '1954973056', 'result_rows': '1', 'result_bytes': '1048832'}
[(0, 0, 0)]
Completed in: 10.906595
=================================

Distributions

Counts of all events per org

org	cnt
openedX	169.63 million
tacoX	169.26 million
chipX	166.65 million
burritoX	165.96 million
salsaX	165.89 million
guacX	162.60 million

Counts of all events from the 20 largest courses

course_run_id	cnt
http://localhost:18000/course/course-v1:openedX+DemoX+1de81ce4-5c94-4f00-96e4-8f01b07f61af	999.00
http://localhost:18000/course/course-v1:chipX+DemoX+4d88b8d9-cc18-45a8-a0ad-0e6c7743577c	999.00
http://localhost:18000/course/course-v1:openedX+DemoX+d7c56e89-ded1-4ba7-a76f-8d0c3feeacba	999.00
http://localhost:18000/course/course-v1:tacoX+DemoX+00561f5c-edec-466e-a85e-fb19a804c8c2	999.00
http://localhost:18000/course/course-v1:salsaX+DemoX+f77613de-614d-451f-b03e-65eb225d4aa3	999.00
http://localhost:18000/course/course-v1:burritoX+DemoX+8cd4887a-3e7a-45ca-b3b9-fe83b670f10a	999.00
http://localhost:18000/course/course-v1:guacX+DemoX+a40b6284-26ef-4dea-a5dd-41eb23fe7306	999.00
http://localhost:18000/course/course-v1:tacoX+DemoX+39ce0128-32fc-4e1b-850c-f76c4b249292	998.00
http://localhost:18000/course/course-v1:salsaX+DemoX+9948c6b1-9670-4718-a6d5-3cea3940e386	998.00
http://localhost:18000/course/course-v1:guacX+DemoX+774a8d12-da7b-45d1-9d8a-d8f8b3007c3d	998.00
http://localhost:18000/course/course-v1:salsaX+DemoX+86fc8bfc-283c-4ae3-8f2f-edf2470a91b3	998.00
http://localhost:18000/course/course-v1:guacX+DemoX+89fce415-e0ec-49c6-a329-d0be04b7677f	998.00
http://localhost:18000/course/course-v1:tacoX+DemoX+bac76677-e3ae-4145-97c8-4dbfdd8d1914	998.00
http://localhost:18000/course/course-v1:tacoX+DemoX+a98e3487-cad5-45e2-a937-83d24f220b05	998.00
http://localhost:18000/course/course-v1:guacX+DemoX+0c4d6f2e-c2b7-4279-9f48-130c7c948502	998.00
http://localhost:18000/course/course-v1:tacoX+DemoX+a78d871b-a9d0-4c17-86b6-d66ce88adbb7	998.00
http://localhost:18000/course/course-v1:chipX+DemoX+fa565041-65f0-4517-a546-ab7c7c4071fb	998.00
http://localhost:18000/course/course-v1:salsaX+DemoX+32d03cda-0a09-498d-8a8e-74770cbef2dc	998.00
http://localhost:18000/course/course-v1:salsaX+DemoX+1be9caa0-a579-454d-9962-a1e0fbb9a888	998.00
http://localhost:18000/course/course-v1:guacX+DemoX+cc9d53c9-93eb-41eb-85fb-44709ef36c6a	998.00

Counts of each verb across the whole data set

(lines up with our expected distribution)

verb	cnt	hrcnt
https://w3id.org/xapi/video/verbs/played	288166666	288.17 million
https://w3id.org/xapi/video/verbs/paused	178375576	178.38 million
https://w3id.org/xapi/video/verbs/seeked	137221603	137.22 million
http://adlnet.gov/expapi/verbs/attempted	91889411	91.89 million
http://adlnet.gov/expapi/verbs/evaluated	90496202	90.50 million
http://adlnet.gov/expapi/verbs/initialized	74601168	74.60 million
https://w3id.org/xapi/dod-isd/verbs/navigated	71865084	71.87 million
http://adlnet.gov/expapi/verbs/terminated	38444289	38.44 million
http://adlnet.gov/expapi/verbs/asked	15170398	15.17 million
http://adlnet.gov/expapi/verbs/registered	11915589	11.92 million
http://id.tincanapi.com/verb/unregistered	1529138	1.53 million
http://adlnet.gov/expapi/verbs/passed	324876	324.88 thousand

Count of total unique actors

This got limited by the script, as desired

uniqExact(actor_id)

1000000

Count of distinct course runs

uniqExact(course_run_id)

100271

Count of distinct problem ids

This is a bug in the loading script! There are 182.39 million rows with problem ids, but only 205 unique problem ids so they’re being reused too much.

uniqExact(problem_id)

205

Count of distinct video ids

This timed out, video events are a huge percentage of overall events so there are a disproportionately large number of video ids generated, something like 3,600 per course. This is a bug in the script that should be fixed going forward.

uniqExact(nav_starting_point)

98

uniqExact(nav_ending_point)

100

Count of all events per month

Events were generated from 2017/11/22 - 2022/11/23 which is why those first and last entries are smaller.

year	month	formatReadableQuantity(count())
2017	11	3.77 million
2017	12	16.98 million
2018	1	16.99 million
2018	2	15.35 million
2018	3	17.00 million
2018	4	16.44 million
2018	5	16.99 million
2018	6	16.43 million
2018	7	16.99 million
2018	8	16.98 million
2018	9	16.43 million
2018	10	16.99 million
2018	11	16.44 million
2018	12	16.99 million
2019	1	16.99 million
2019	2	15.34 million
2019	3	16.99 million
2019	4	16.43 million
2019	5	16.98 million
2019	6	16.44 million
2019	7	16.98 million
2019	8	16.99 million
2019	9	16.44 million
2019	10	16.98 million
2019	11	16.44 million
2019	12	16.98 million
2020	1	16.99 million
2020	2	15.89 million
2020	3	16.98 million
2020	4	16.44 million
2020	5	16.98 million
2020	6	16.44 million
2020	7	16.98 million
2020	8	16.98 million
2020	9	16.44 million
2020	10	16.98 million
2020	11	16.44 million
2020	12	16.99 million
2021	1	16.98 million
2021	2	15.35 million
2021	3	16.98 million
2021	4	16.43 million
2021	5	16.98 million
2021	6	16.44 million
2021	7	16.99 million
2021	8	16.98 million
2021	9	16.44 million
2021	10	16.98 million
2021	11	16.43 million
2021	12	16.98 million
2022	1	16.99 million
2022	2	15.34 million
2022	3	16.99 million
2022	4	16.44 million
2022	5	16.98 million
2022	6	16.44 million
2022	7	16.99 million
2022	8	16.99 million
2022	9	16.44 million
2022	10	16.99 million
2022	11	12.12 million

Open edX Community

ClickHouse as an LRS Data Store