Repository Health Data & Dashboards
In order to maintain a state of good repair across the hundreds of software repositories used in Open edX, we’ve developed a framework for defining, collecting, and presenting health metrics of interest. The key elements of this framework are:
Checks - A check is a specific metric calculation about the health of a software repository that can be programmatically determined (either from the repository itself or a related data source like the GitHub API). Examples include “how recently was a commit merged to the main development branch?” and “does it claim to support Python 3.12 yet?”
Jobs - A job calculates the value of multiple checks for one or more software repositories and records the results in a persistent storage location. Although the job may take over an hour to run, the stored results can power dashboards which can answer most relevant questions in seconds. Jobs are typically run daily, which is considered a reasonable balance between up-to-date data and computation cost.
Storage - Once collected by a job, check results are written to persistent storage for future reference. The most comprehensive data storage is currently YAML files in a git repository; this is easy to inspect and preserves a history of every check result ever calculated and persisted to that data repository. Derived storage formats such as CSV and SQLite files are also generated and saved as a more convenient basis for powering dashboards.
Dashboards - A dashboard is one way of viewing the health data persisted by the jobs. This can be a Google Sheet, a console report, an interactive web report, etc. Each dashboard is typically created with a particular project or audience in mind, and is customized to present just the information most relevant for that purpose. A dashboard can either be directly updated by a job or operate off of the results stored by a job.
Dashboards
The following dashboards have been created so far:
Streamlit - Presents a curated subset of check results intended for teams responsible for maintaining one or more Open edX repositories. It suggests a priority order for tackling identified tech debt, and the top reason(s) for addressing each item. It also provides a rich data grid of the full health data as an aid for updating the dashboard configuration. It is an interactive web report run on a developer’s laptop based on a SQLite data file and a YAML configuration file, implemented using https://streamlit.io/. The latest version is under review at https://github.com/openedx/edx-repo-health/pull/466 .
Console - A predecessor of the Streamlit dashboard which shares the same configuration but displays its output on the command line console instead of in a web page. Uses https://github.com/Textualize/rich for nice formatting.
Google Sheet - A grid of all the health data thrown into a spreadsheet for exploration. This was the first dashboard, chosen due to ease of implementation. It’s fairly convenient for architects and maintenance-focused squads to perform ad hoc queries of the data, but contains too much non-curated information to provide useful guidance for most engineers or managers. Only a 2U-internal version exists so far, although the code to generate it is open source.
Python Dependencies - In the final stages of development, this report summarizes the health of the (often 3rd-party) Python packages installed by Open edX services. The initial use case is to aid in planning and executing an upgrade to a newer Python version, but is also expected to be useful in identifying dependencies that pose a risk to security or future framework/language upgrades. Progress to completion can be tracked at https://github.com/edx/edx-arch-experiments/issues/231.
Metabase - An early experiment in a better UI for exploration of the health data than the Google Sheet. Arguably better than the spreadsheet for ad hoc queries and exploration, but impractical to maintain and share any custom queries or views without a shared server and further development effort. The code for this is 2U-private but trivial, and can be shared if others are interested in trying it. See https://www.metabase.com/ for more information about the tool.
Where It Is Implemented
The implementation of the repository health framework itself is distributed across several repositories:
https://github.com/openedx/pytest-repo-health - This is the primary tool we use to define checks. It is an extension of the pytest unit test runner for Python which allows us to implement checks as concise Python functions. It also contains pytest fixtures for easily collecting information from git and GitHub, and would be a logical home for other such Open edX-agnostic utility functions.
https://github.com/openedx/edx-repo-health - This is where the checks and dashboards relevant to Open edX are defined. pytest-repo-health was intended to be useful for a variety of software ecosystems (Open edX, Django, pytest, etc.), so the set of checks to be run is taken as an argument. (But this split may have been a premature optimization, as this is currently the only place where checks have been defined.)
https://github.com/openedx/.github - This is the home of the openedx GitHub organization’s reusable GitHub Actions workflows, including one implementing a health data collection job. This job is parameterized to support different sets of checks, repositories to run them against, and storage repositories.
edx/repo-health-data - This is 2U’s original health data storage repository, and is private to 2U employees and contractors because it includes information about private repositories. It also contains 2U’s instance of the job workflow from openedx/.github.
openedx/repo-health-data (to be created) - This will be the storage repository for all check results against Open edX repositories which are considered suitable for public distribution. It will also hold the instance of the job workflow from openedx/.github used to collect that data.