Data WG 2022-02-10 Meeting notes

Open edX Core Metrics Date

Feb 10, 2022

Participants

@Edward Zarecor
@Andy Shultz (Deactivated)
@Tobias Macey
@Sofiane Bebert
@Simon Chen
@Maria Fernanda Magallanes Z

Goals

Discussion topics

Time	Item	Presenter	Notes

Time	Item	Presenter	Notes
	Abstraction Layer for Open edX Core Metrics		Review the current list of core metrics We floated the idea of creating an abstraction layer for those core metrics. How should we go about implementing it? Figures compat module AppSembler are planning to follow Django release process Events are a possible vanguard effort that is facing the same problems we have related to well specified data models. Schema enforcement is possible with externalized definitions Allows testing of contracts The events team is using Avro schema Is there a way to demo Figures? Open Question Using the devsite Is there any sharable piece of the 2U DBT work? Andy has been thinking about this, but 2U rely on a lot of upstream data cleansing in order to produce the final set of metrics. Probably makes the most sense to have a separate community implementation With an event stream you could fill the tables that edX analytics api uses Tracking logs They are a mess Our approach has been fix in the warehouse Dave advocates for a approach that fixes the data upstream
	Plugin for data API is being explored as an option by Tobias at MITx	@Tobias Macey	Initial use case is related to exporting the course content. Another need is enriching block ids with human facing names and follow bread crumbs through course navigation Andy asks about the list of users for the course – this data is “toxic waste” and fill of PII that you would not want to get by mistake. Expected to use celery facilities for asynchronous Does this need to cover CCX courses? Tobias does not think that it does. Dave worries this may not work by default Would this plugin need to reach across boundaries and use ORM models from other apps? V1 probably yes Hope that this can help define stable APIs Is the long term vision batch forever or is streaming changes part of their future plans See incremental updates over Kafka or Pulsar as a valuable future state Focused on batch for now? Is there any design docs for the data API? MIT are currently in the early discovery phase and the data that will be pulled hasn’t been fully defined. API is focused on raw data now, not core metrics. MIT code will be in a public repo, and will publish details about design and progress in the Data Working Group channel. Is the plugin model baked enough to consider it a best practice for decomposing the monolith? CI protections are still weak. This was one of the goals of the events and signals work Overall, stable APIs are emergent and not robust yet.

Action items

Ed will prod the group for content early in the week on weeks that we have scheduled meetings

Decisions

General consensus that schema enforcement across the platform would be valuable