Open edX Analytics Ecosystem

This is a “working document” page to capture, comment, and build out ideas for an Open edX analytics ecosystem which can serve the range of deployed Open edX instance from very small scale to medium-large scale.

Overview

placeholder

Original Post

This section contains @Brian Mesick 's comment in the Open edX Slack #wg-data channel, https://openedx.slack.com/archives/C023AGJJWLV/p1667417744450929

[Brian Mesick] Per our discussion in the WG today, here’s my current tech stack proposal for a “supported out of the box” Open edX analytics ecosystem that will work for very small to medium-large instances. This should be considered a rough draft or my current thoughts as we work through the open questions and new suggestions come in.

As always, feedback and questions are welcome! I’d like to continue doing a few more spikes to find the limitations of these systems and gain confidence that they’ll all play nice together and with our platform. I hope to turn this into a forum post and series of OEPs / ADRs and then tickets once the first order investigations bear fruit.

I’ve frequently heard these needs from the community and I’m using them as benchmarks for technology suggestions:

  1. Formalized and well versioned data structures

  2. Near real-time delivery and reporting

  3. High fault tolerance / robustness of delivery / disaster recovery

  4. Ease of setup / low maintenance / can be bundled with Tutor (implies open source, which we favor anyway)

  5. Scales down / low cost to run

  6. Scales up / can support medium-high volume instances (but explicitly we are not solving for extremely high volume instances)

  7. Secure in a multi-tenant environment / supports course-specific and org-specific permissions on reports

  8. Professional looking, customizable, and extensible reporting visualizations

  9. Long term we want to support adaptive learning and other data driven educational technologies

I break these up into the following technical areas:

  • Data structure

  • Delivery

  • Storage

  • Display

An area I hope we can avoid including is separate tooling for transformation and aggregation, which is where the bulk of the complexity and cost of Insights is concentrated. With a modern backing database and display engine I think this is achievable.

On to the suggestions!

Data structure - xAPI via event-routing-backends

Pros: We have an excellent start on some of these with event-routing-backends, which formalize some of our native events into xAPI and Caliper standards. There have been several specific asks for xAPI (none so far for Caliper that I’ve heard). It currently allows near real-time delivery via synchronous API calls or Celery task to an LRS. We are already working on expanding the xAPI adoption and event transformations, and I think that’s the right direction to continue.

Cons: Performance concerns about Celery (next section addresses this), xAPI is complicated to work with, custom events for an Open edX instance would currently require forking event-routing-backends, tracking log versioning is still being defined and may be difficult for large installs to adapt to as we flesh out the xAPI needs.

Alternatives: Continue using xAPI but do the tracking log -> xAPI transformation at the LRS layer (or in a separate, more configurable, package), which would allow event flexibility and customization.

Delivery - event-routing-backends via event bus (including new redis streams implementation)

Prosedx.org has put a lot of work into making a usable event bus framework that can help solve many timeliness and robustness issues. In my opinion their Kafka solution is too much of an infrastructure lift for most operators and it would be necessary to implement a lighter weight event bus. redis streams seem like a good candidate for this work as it means no new technology in Tutor for development or production (though the increase in redis usage would likely require operators to make adjustments). The ability to scale up and delivery guarantees would be a huge boon to making real-ish-time events reliable enough for analytics and operational use cases to rely on.

Cons: Creating a second message bus implementation is a very big lift, this is likely something tCRIL would need to fund an external team to deliver. Kafka and redis streams are significantly different in some important ways, there would need to be some careful negotiation of assumptions in producer and consumer code. There would also be additional new layers added to the (already complicated) event processing code in order to use the Django Signals based event bus as defined in OEP-41 and OEP-52.

Alternatives: Load test Celery and see if we can make it work in a robust enough configuration to meet our needs. Use a synchronous log handler to hand native event logs to the LRS and do the transformations there (see the next item).

Storage - Ralph (with a new Clickhouse storage backend)

Pros: OpenFUN has an Open edX oriented LRS in Ralph, which is both xAPI and tracking log event compatible, and currently uses MongoDB and ElasticSearch as backends to store the event data. Ralph includes the vital capability to replay old event log data into xAPI statements to backfill a new analytics database or recover from a delivery failure. It could enable some of the “alternative” solutions above, and is probably the only LRS that could do so. It’s being actively developed by community members! Clickhouse is open source, JSON friendly, fairly lightweight, has row level permissions, and seems like it would be relatively simple to add as a Tutor plugin. It is highly performant on exactly the kinds of workloads we expect and can scale up to some really impressive sizes.

Cons: Ralph is still under heavy development and according to the docs doesn’t yet fully implement the ADL LRS spec (but does what I think we would need for this project). ElasticSearch seems to be difficult to index for these kinds of workloads, and as a whole Open edX is moving away from MongoDB, so my recommendation is that we (as a community) build Clickhouse support into Ralph, if OpenFUN is willing to take those changes. I’m not sure how large a project that would be, but the ES and Mongo implementations seem to be under 300 lines of code.

Alternatives: Clickhouse is the tech I’m least knowledgable about on this list, and there are numerous other analytics databases out there right now. The only other one that I’m aware of right now that checks all of the boxes for us is Citus Postgres which seems to only have a cloud offering in Azure, while Clickhouse is only in beta in AWS, but seems to be moving toward other providers as well. I have not seen any other LRS that comes close to the capabilities we see in Ralph. Ralph also provides xAPI transformations from tracking log events, which is an overall alternative to using event-routing-backends to do that work. Those translations are separate from what we have in event-routing-backends and it would be nice to find a solution that would help us consolidate that work while still allowing operators flexibility to add or modify xAPI translations.

Display - Apache Superset against Clickhouse (and maybe MySQL?)

Pros: Superset is open source and flexible, with connectors to a huge set of data sources. It has some very nice display options and is built to handle data sizes that should exceed our medium-large targets. Superset can use LMS as an auth provider. It can provide alerts and emailed reports (if configured for it). It allows us to lock down report generation to just admin users. It is possible to just iframe reports directly into another app if we care to or need to. It is possible to tie an LMS MySQL read replica to Superset to get permissions and course data in directly, though I’d prefer we find a more elegant / inexpensive / secure solution.

Cons: Not just a Superset con, but for any solution we will need the capability to move some currently-non-xAPI data into the data lake (or create xAPI profiles for this data), such as course data and user permissions. We would probably need a way to work with the Superset API to create permission groups that would be granular enough to meet our needs, @jillvogel is currently looking into this as part of an investigation spike. It’s a big app that may be somewhat more complicated to set up as a Tutor plugin or standalone.

Alternatives: I’m not currently aware of any other open source alternatives that check our boxes.

Current open questions:

  1. How much more work do we need to put into our xAPI transformations to get them to a fully accepted state?

    1. First set of work is here: https://github.com/openedx/event-routing-backends/milestone/1

    2. More investigations are here:

      1. https://github.com/openedx/event-routing-backends/issues/217

      2. https://github.com/openedx/event-routing-backends/issues/219

      3. https://github.com/openedx/data-wg/issues/15

  2. How do we get permissions and course data into the data lake?

    1. I’m hoping to get more information on the storage layer before writing a spike for this.

  3. Can Superset handle our permissions model in a graceful and secure way?

    1. Spike: https://github.com/openedx/data-wg/issues/21

  4. How realistic is the redis streams project? I think it has a lot of value both for this group and for Open edX as a whole, but it is also the most difficult and time consuming project on this list!

    1. Spike: https://github.com/openedx/data-wg/issues/22

  5. Is Clickhouse all that I hope? If so, will OpenFUN accept it as a backend?

    1. Spike: https://github.com/openedx/data-wg/issues/24

  6. How does performance look on these reports with full permissions and several billion rows?

    1. I’m hoping to get more information on the storage layer before writing a spike for this.