ClickHouse as an LRS Data Store

Overview

As part of the project to replace Insights, we are looking at several different pieces of technology. A central technology is the data store, which has (or lacks) functionality that drives many other adoption decisions.

Ideally a data store for an Open edX LRS would meet the following criteria:

  • Free and open source

  • Lightweight enough to be reasonably run in Tutor

  • Scales down (can be run on commodity hardware for small installs)

  • Scales up (can handle billions of rows affordably)

  • Can be deployed in a fault tolerant way

  • Has a cloud offering for medium-large installs that don’t want to run their own

There are a few potential options that may check all of those boxes, but one that came recommended from a few people in the community was ClickHouse. So this issue was created to get hands-on experience with it and get an understanding if it can work for us.

Summary

The good…

ClickHouse is a very capable database that performed extremely well at a billion rows under Docker on my M1 Mac. It was easy to run and configure as a Tutor plugin. Using default configuration it took up little CPU and ~500mb of RAM, even fully loaded. That’s a lot of memory, but not out of line from other services we run in Tutor.

I was able to load the billion rows from a single-threaded Python script in just under 18 hrs. Under load, queries that had been optimized for (ex: “Count of enrollment events for a course”) completed in 0.09 - 0.13 seconds while loading the last 100 million rows. Un-optimized queries (“Count of enrollments for this learner”) completed in 10 - 12 secs. Details on this test are below.

I very much believe that ClickHouse could support even our largest install bases if configured correctly.

Native tools work well and were easy to set up, as was PyCharm / DataGrip integration.

There is native MySQL connectivity to import data, which could be helpful for synchronizing permissions or other data from our various MySQL sources.

There is native support for loading data from S3 and other stores.

We are considering Apache Superset for a display layer for this data, and integrating with it was trivial as it’s natively supported.

The bad…

ClickHouse is under fairly heavy development and seems to just be really reaching maturity for a product in this space. I was able to use windowing functions for displaying time series data in Superset, but it was fairly recently introduced and documentation around those functions was conflicting. There were a handful of features/functions like this (JSON column availability stands out).

Some SQL syntax can be confusing, some things are case-sensitve while others aren’t. Transactions aren’t supported and other deviations from the ANSI spec (https://clickhouse.com/docs/en/sql-reference/ansi ). None of these things prevented me from doing anything I needed to do, but there are important tradeoffs made to get the level of performance we see.

There are a lot of configuration options, some are very low level. Figuring out reference implementations and reasonable Tutor defaults will require a fair amount of work.

These are not unusual problems in this space, and I didn’t find any of them to be deal breakers.

The ugly…

The only truly unusual thing that I did in my test was the use of a Buffer table to cache data for bulk insert into the final materialized table. This is the recommended way to maintain performance on write-heavy tables and does introduce some complexity.

Namely that data can be delayed on the final table (but queries can be directed at the buffer table and will return results for the buffer and materialized table). Buffers are stored in RAM, and so can be prone to data loss in the event of a crash, though we should be planning for recovering from such situations anyway.

Otherwise I didn’t find anything particularly troublesome in ClickHouse and think it can be a very solid candidate for our use case.

Testing and Goals

I ran several increasingly larger rounds of testing by using https://github.com/bmtcril/xapi-db-load to generate fake xAPI events and batch load them to ClickHouse in a single thread. The goal wasn’t to test how quickly events could be loaded, but rather check the performance with a moderately large scale of data under pretty strict hardware constraints. In addition to the goals stated in the original Github Issue, I was hoping to answer the following questions:

  • Can it run loaded with a moderate amount of data in a Tutor environment for development and testing?

    • Yes, I was able to run Tutor nightly dev with my ClickHouse plugin and Superset (run outside of Tutor) on an M1 Macbook Pro under Docker Desktop with 4CPUs and 10GB of memory assigned to Docker. This is a lot, but ClickHouse was far from the biggest resource consumer.

  • How difficult is it to code for in Python?

    • It’s a fairly normal database API, I ran into some issues with parameterized queries on the bulk inserts, but I didn’t put much time into trying to figure it out. Everything else worked fine.

  • How complicated is it to stand up and configure a single node instance and connect locally?

    • Using the Docker image the server was running in 5 mins and I was able to connect for a local shell in 10 using a binary client.

    • Initially I ran into some Mac issues that caused the server to spam some warnings while inserting data that required me to override some settings, however a new version was pushed so that when I ran under Tutor that fixed the issue.

    • The client and local tools are quite good and have some really unique features for things like ETL and bulk loading.

  • How complicated is it to set up as a Tutor plugin?

  • How hard is it to optimize for Insights-replacement use cases?

    • I took 2 passes at optimization, one with a 100M row data set, one with 1B rows. In both cases I focused on table design (breaking different keys out of the JSON), primary key, and order by . The changes were fairly straightforward and had a huge impact, dropping times on non-optimized queries from over 30s to 10-12s. Further work can be done here, once a diverse set of use cases has been ironed out, but I believe it can be tuned at this level for even medium-large installations.

    • For very large installs there is undoubtedly a lot of tuning required and options like clustering or cloud hosting are available, but that was outside the scope of this test.

    • Notably, the primary key and order by of a table are not changeable in a table. A new table needs to be created and the old data loaded into it, so it is important to get this right the first time!

    • Open question: Do we want “one big table” or does it make more sense to group statements by a secondary key (problem events, video events, navigation events, etc). The latter makes performance tuning easier at the cost of complicating data transformation and loading configuration.

  • How hard is it to integrate with Ralph?

    • I did not do a test implementation, but based on the ElasticSearch and MongoDB Ralph backends I didn’t see anything concerning.

  • How hard is it to integrate with Superset?

    • It was trivially easy to access data, however Superset’s security model is more of a challenge (see ).

  • How painful is it to operate on JSON fields?

    • About the same as with Redshift / Postgres / Snowflake, that is to say pretty painful but doable.

Testing Methodology

I’m going to focus on the 1B row test here as previous rounds had some material issues.

Clickhouse Server

  • Version 22.10.2.11 from Docker image clickhouse/clickhouse-server:latest (9a075a7adb7c)

  • Run under Tutor using my plugin with default configuration

  • Only the ClickHouse plugin was running, so it had full access to Docker’s configured 4 CPUs and 10GB of RAM

  • I ran with two non-default options on my client:

    • date_time_input_format="best_effort" Allows RFC dates to be ingested

    • old_parts_lifetime=10 this is the number of seconds that old files hang around, I reduced it to save disk space because I knew the big run would be close to filling my disk

xapi.xpi_events_all Table

  • One table was used to collect all events, with a buffer table created to cache inserts for efficiency as recommended by ClickHouse

  • Some top level event data (actor id, problem id, video id, etc) were extracted, but not indexed on. This didn’t seem to make a huge performance difference.

  • course_run_id, verb, and emission_time were used for primary key and/or table sorting

  • The full event xAPI JSON was stored as well

  • Table DDLs are here:

Events

  • Courses, actors, problems, sequences, and videos were randomly generated in such a way that they could be reused and have the same courses / actors etc be used multiple times, creating a fairly evenly distributed number of events for each course / user / etc.

  • Events were distributed over a span of the last 5 years (2017/11 - 2022/11)

  • Orgs were selected randomly from a fixed list of 6

  • Events were created randomly according to the distribution seen on edx.org over a 2 month period, but included all types of event available in event-routing-backends at the time. The distribution used can be seen here.

  • Event JSON was string manipulated to include our randomized data using templates extracted from example test fixtures in event-routing-backends.

  • Events were generated and inserted in batches of 10k

Insert run

  • Ran the load generation script with --num_batches 100000 and --drop_tables_first true

  • Runtime was 17 hrs, 57 mins, final size on disk was 289.35 GB

  • Time was collected every 1M rows inserted, given the scale it did not slow down terribly

    • Early rounds of 1M took 1-4 secs

    • Final rounds took 3-5 secs

  • Some queries were also run between every 1M rows

    • Optimized queries that used course_run_id or verb consistently completed in 0.1 secs or less up to 900M rows

    • Unoptimized queries that used org or actor and verb had execution times grow, but with as long as verb was used they leveled off around 10-12 secs up to 900M rows

    • Unoptimized queries that used none of the indexed columns timed out at 30s, which is to be expected at this size

Done! Added 1000000000 rows! [('UTC', datetime.datetime(2022, 11, 23, 12, 29, 25))] Buffer table row count: [(1000000000,)] Hard table row count: [(999590000,)] Count of enrollment events for course http://localhost:18000/course/course-v1:chipX+DemoX+7e55ce84-b1fc-429e-ba81-e6da0de42a8d {'read_rows': '229376', 'read_bytes': '33605593', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '229376', 'result_rows': '1', 'result_bytes': '256'} [(70,)] Completed in: 0.090815 ================================= Count of total enrollment events for org openedX {'read_rows': '976654304', 'read_bytes': '22857207677', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '976654304', 'result_rows': '1', 'result_bytes': '256'} [(2018628,)] Completed in: 3.953353 ================================= Count of enrollments for this learner {'read_rows': '976794304', 'read_bytes': '63556376368', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '976794304', 'result_rows': '1', 'result_bytes': '256'} [(2,)] Completed in: 12.779382 ================================= Count of enrollments for this course - count of unenrollments, last 30 days {'read_rows': '425984', 'read_bytes': '64114078', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '425984', 'result_rows': '1', 'result_bytes': '1048832'} [(2, 0, 2)] Completed in: 0.136299 ================================= Count of enrollments for this course - count of unenrollments, all time {'read_rows': '212992', 'read_bytes': '31205071', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '212992', 'result_rows': '1', 'result_bytes': '1048832'} [(70, 0, 70)] Completed in: 0.023204 ================================= Count of enrollments for all courses - count of unenrollments, last 5 minutes {'read_rows': '1954973056', 'read_bytes': '104370532775', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '1954973056', 'result_rows': '1', 'result_bytes': '1048832'} [(0, 0, 0)] Completed in: 10.906595 =================================

Distributions

Counts of all events per org

org

cnt

openedX

169.63 million

tacoX

169.26 million

chipX

166.65 million

burritoX

165.96 million

salsaX

165.89 million

guacX

162.60 million

 

Counts of all events from the 20 largest courses

course_run_id

cnt

http://localhost:18000/course/course-v1:openedX+DemoX+1de81ce4-5c94-4f00-96e4-8f01b07f61af

999.00

http://localhost:18000/course/course-v1:chipX+DemoX+4d88b8d9-cc18-45a8-a0ad-0e6c7743577c

999.00

http://localhost:18000/course/course-v1:openedX+DemoX+d7c56e89-ded1-4ba7-a76f-8d0c3feeacba

999.00

http://localhost:18000/course/course-v1:tacoX+DemoX+00561f5c-edec-466e-a85e-fb19a804c8c2

999.00

http://localhost:18000/course/course-v1:salsaX+DemoX+f77613de-614d-451f-b03e-65eb225d4aa3

999.00

http://localhost:18000/course/course-v1:burritoX+DemoX+8cd4887a-3e7a-45ca-b3b9-fe83b670f10a

999.00

http://localhost:18000/course/course-v1:guacX+DemoX+a40b6284-26ef-4dea-a5dd-41eb23fe7306

999.00

http://localhost:18000/course/course-v1:tacoX+DemoX+39ce0128-32fc-4e1b-850c-f76c4b249292

998.00

http://localhost:18000/course/course-v1:salsaX+DemoX+9948c6b1-9670-4718-a6d5-3cea3940e386

998.00

http://localhost:18000/course/course-v1:guacX+DemoX+774a8d12-da7b-45d1-9d8a-d8f8b3007c3d

998.00

http://localhost:18000/course/course-v1:salsaX+DemoX+86fc8bfc-283c-4ae3-8f2f-edf2470a91b3

998.00

http://localhost:18000/course/course-v1:guacX+DemoX+89fce415-e0ec-49c6-a329-d0be04b7677f

998.00

http://localhost:18000/course/course-v1:tacoX+DemoX+bac76677-e3ae-4145-97c8-4dbfdd8d1914

998.00

http://localhost:18000/course/course-v1:tacoX+DemoX+a98e3487-cad5-45e2-a937-83d24f220b05

998.00

http://localhost:18000/course/course-v1:guacX+DemoX+0c4d6f2e-c2b7-4279-9f48-130c7c948502

998.00

http://localhost:18000/course/course-v1:tacoX+DemoX+a78d871b-a9d0-4c17-86b6-d66ce88adbb7

998.00

http://localhost:18000/course/course-v1:chipX+DemoX+fa565041-65f0-4517-a546-ab7c7c4071fb

998.00

http://localhost:18000/course/course-v1:salsaX+DemoX+32d03cda-0a09-498d-8a8e-74770cbef2dc

998.00

http://localhost:18000/course/course-v1:salsaX+DemoX+1be9caa0-a579-454d-9962-a1e0fbb9a888

998.00

http://localhost:18000/course/course-v1:guacX+DemoX+cc9d53c9-93eb-41eb-85fb-44709ef36c6a

998.00

 

Counts of each verb across the whole data set

(lines up with our expected distribution)

 

Count of total unique actors

This got limited by the script, as desired

uniqExact(actor_id)

1000000

 

Count of distinct course runs 

uniqExact(course_run_id)

100271

 

Count of distinct problem ids

This is a bug in the loading script! There are 182.39 million rows with problem ids, but only 205 unique problem ids so they’re being reused too much.

uniqExact(problem_id)

205

 

Count of distinct video ids

This timed out, video events are a huge percentage of overall events so there are a disproportionately large number of video ids generated, something like 3,600 per course. This is a bug in the script that should be fixed going forward.

 

Count of unique navigation starting and ending points

uniqExact(nav_starting_point)

98

uniqExact(nav_ending_point)

100

 

Count of all events per month

Events were generated from 2017/11/22 - 2022/11/23 which is why those first and last entries are smaller.

year

month

formatReadableQuantity(count())

2017

11

3.77 million

2017

12

16.98 million

2018

1

16.99 million

2018

2

15.35 million

2018

3

17.00 million

2018

4

16.44 million

2018

5

16.99 million

2018

6

16.43 million

2018

7

16.99 million

2018

8

16.98 million

2018

9

16.43 million

2018

10

16.99 million

2018

11

16.44 million

2018

12

16.99 million

2019

1

16.99 million

2019

2

15.34 million

2019

3

16.99 million

2019

4

16.43 million

2019

5

16.98 million

2019

6

16.44 million

2019

7

16.98 million

2019

8

16.99 million

2019

9

16.44 million

2019

10

16.98 million

2019

11

16.44 million

2019

12

16.98 million

2020

1

16.99 million

2020

2

15.89 million

2020

3

16.98 million

2020

4

16.44 million

2020

5

16.98 million

2020

6

16.44 million

2020

7

16.98 million

2020

8

16.98 million

2020

9

16.44 million

2020

10

16.98 million

2020

11

16.44 million

2020

12

16.99 million

2021

1

16.98 million

2021

2

15.35 million

2021

3

16.98 million

2021

4

16.43 million

2021

5

16.98 million

2021

6

16.44 million

2021

7

16.99 million

2021

8

16.98 million

2021

9

16.44 million

2021

10

16.98 million

2021

11

16.43 million

2021

12

16.98 million

2022

1

16.99 million

2022

2

15.34 million

2022

3

16.99 million

2022

4

16.44 million

2022

5

16.98 million

2022

6

16.44 million

2022

7

16.99 million

2022

8

16.99 million

2022

9

16.44 million

2022

10

16.99 million

2022

11

12.12 million