State of Data Working Group (March 2023)

Date: 17 March 2023

Introduction

The Data Working Group focuses primarily on advancing the data and analytics capabilities of the Open edX platform. Our primary goals are to establish and promote data and analytics best practices across the ecosystem, and ensure that the Open edX platform provides and supports analytics capabilities specifically for small- to medium-sized Open edX deployments.

Consider this report as a compilation of what we’ve done over the past year, what we’re looking at doing in the next 6 months, and a wider vision of the group’s future beyond that.

Accomplishments

Over the past year, our major accomplishments were:

  • Initiating and specifying three major milestone projects: Open Analytics Reference System (OARS), Increasing Message Bus Adoption, and Tracking Log Event Cleanup. These projects arose from the experiences and discussions between Data WG members. But they really took shape once Brian Mesick joined tCRIL to head the Platform Data effort, and Jenna Makowski directed her Product Management lens at the pain points and use cases expressed by the community.

  • xAPI schemas and tracking events conversion

  • xAPI learning traces in the data lake

Further, we’ve made significant progress on the following initiatives:

  • Discovery and Specification for OARS V1
    We have decided to replace Insights with the Open Analytics Reference System, a light, flexible data pipeline based primarily on 3rd party open source solutions for routing, storing and analyzing Open edX event data. The OARS architecture will be cost effective for small to medium sized Open edX sites, will scale appropriately for their expected data use, and comprise loosely-coupled components which operators can exchange if their deployments require. OARS is composed of several components, and we have investigated and decided on the technologies included in the reference implementation, including Clickhouse, Ralph, and Superset.

  • Reference implementation for OARS V1
    Alongside the discovery and specification, we have also begun the reference implementation for OARS in the Tutor environment.

  • Redis Streams as a Message Bus
    We have a funded contribution project in progress with OpenCraft to complete a reference implementation of redis streams as a second concrete implementation of the Open edX message bus. This should allow all operators of Open edX to be able to gain the benefits of asynchronous messaging and advance our plans for a less coupled architecture.

  • Google Analytics 4 upgrade
    We also have a funded contribution project in progress with Racoon Gang to upgrade our Google Analytics support and expand GA tracking into several microfrontends that did not have GA support added when they were moved out of edx-platform.

Concrete Plans - Next 6 Months

Over the next six months, the Data Working Group expects to start seeing the fruits of our planning. We hope for make a functional OARS v1 available for testing and feedback, and make detailed plans for OARS v2 which will be centered around returning processed analytical data to instructors directly in the CMS, reacting to early feedback, and growing the dataset and available reports.

We expect the funded contributions for Redis as a Message Bus and Google Analytics 4 Upgrade to be completed in this time frame, unblocking investigations into using the message bus as a delivery mechanism for analytics events. We will also begin our efforts to make those events more reliable with the Tracking Log Event Cleanup work.

Our high level Initiatives are available on the Open edX Roadmap’s Data tab, and epic-level task tracking happens in the Data Working Group project board.

Future Vision for the Group

Moving beyond the next six months, the Data Working Group is looking towards forming a cohesive xAPI profile for Open edX, growing our community’s data capabilities, and collaborating with other working groups on the foundational pieces of adaptive learning. With upcoming projects across the platform for a tagging and taxonomy system, modular learning, and standards-based learning traces in a data lake we will have a solid foundation to push the boundaries of how learner experiences can be tailored to improve educational outcomes!

Deep Dive: Open Analytics Reference System

The following links have details about the high level architectural decisions that have informed the OARS system:

For code-level details on the technical progress of the OARS system, the code lives in the following repositories:

  • tutor-contrib-oars - OARS-specific configuration for tying all of the above systems together.

    • Adds and configures event-routing-backends in the LMS to transform send xAPI statements to Ralph

    • Adds OAuth configuration so that Superset can authenticate users against the LMS and authorize which course data they have access to

    • Adds default datasets, dashboards, and charts to Superset

    • Configures ClickHouse database, tables, views, and users