Repository Health Data & Dashboards

In order to maintain a state of good repair across the hundreds of software repositories used in Open edX, we’ve developed a framework for defining, collecting, and presenting health metrics of interest. The key elements of this framework are:

  • Checks - A check is a specific metric calculation about the health of a software repository that can be programmatically determined (either from the repository itself or a related data source like the GitHub API). Examples include “how recently was a commit merged to the main development branch?” and “does it claim to support Python 3.12 yet?”

  • Jobs - A job calculates the value of multiple checks for one or more software repositories and records the results in a persistent storage location. Although the job may take over an hour to run, the stored results can power dashboards which can answer most relevant questions in seconds. Jobs are typically run daily, which is considered a reasonable balance between up-to-date data and computation cost.

  • Storage - Once collected by a job, check results are written to persistent storage for future reference. The most comprehensive data storage is currently YAML files in a git repository; this is easy to inspect and preserves a history of every check result ever calculated and persisted to that data repository. Derived storage formats such as CSV and SQLite files are also generated and saved as a more convenient basis for powering dashboards.

  • Dashboards - A dashboard is one way of viewing the health data persisted by the jobs. This can be a Google Sheet, a console report, an interactive web report, etc. Each dashboard is typically created with a particular project or audience in mind, and is customized to present just the information most relevant for that purpose. A dashboard can either be directly updated by a job or operate off of the results stored by a job.

Dashboards

The following dashboards have been created so far:

Where It Is Implemented

The implementation of the repository health framework itself is distributed across several repositories:

  • GitHub - openedx/pytest-repo-health - This is the primary tool we use to define checks. It is an extension of the pytest unit test runner for Python which allows us to implement checks as concise Python functions. It also contains pytest fixtures for easily collecting information from git and GitHub, and would be a logical home for other such Open edX-agnostic utility functions.

  • GitHub - openedx/edx-repo-health - This is where the checks and dashboards relevant to Open edX are defined. pytest-repo-health was intended to be useful for a variety of software ecosystems (Open edX, Django, pytest, etc.), so the set of checks to be run is taken as an argument. (But this split may have been a premature optimization, as this is currently the only place where checks have been defined.)

  • GitHub - openedx/.github: Centralized openedx repository workflows, community health files, etc. - This is the home of the openedx GitHub organization’s reusable GitHub Actions workflows, including one implementing a health data collection job. This job is parameterized to support different sets of checks, repositories to run them against, and storage repositories.

  • edx/repo-health-data - This is 2U’s original health data storage repository, and is private to 2U employees and contractors because it includes information about private repositories. It also contains 2U’s instance of the job workflow from openedx/.github.

  • openedx/repo-health-data (to be created) - This will be the storage repository for all check results against Open edX repositories which are considered suitable for public distribution. It will also hold the instance of the job workflow from openedx/.github used to collect that data.