edx-platform unit tests migration from Jenkins to Github Actions
Context
We’ve recently started improving our CI infrastructure based on some of the issues we’ve had maintaining Jenkins based CI setup and with recent changes introduced by Travis that suddenly limited our number of parallel workers and number of minutes available for builds without any announcement, we decided to move away from Travis and use Github Action/Workflows for our CI needs, with an initial successful phase we saw a significant improvement in build time and our ability to maintain and scale our infrastructure for future, we started moving some of our Jenkins based jobs to use Github workflows like jobs for upgrading our python dependencies, edx-platform was the last place where we were still using Jenkins for CI, the problem with moving edx-platform to use another CI solution was the duration of test execution time, previously we’ve been using xdist
to split our unit tests on multiple worker machines to run our tests in parallel to finish test execution in a reasonable time, but still our unit tests were the slowest parts of all the CI checks since we had recently moved our quality checks to Github Actions, Jenkins based unit test build were still taking an average 40 minutes to complete, part of the problem that we identified was the time it took for our scripts to provision enough resources on AWS (provisioning ~12 EC2 instances took around 10 minutes to complete), with our recent successful integrations with github Actions we started looking into possible solutions to split our unit test modules and at least achieve execution time similar to Jenkins.
While doing these POCs we discovered that there were a lot of paver tasks that had added complexity to our unit tests set up, one part of this effort was to simplify this process and try and get rid of unnecessary complexity involved.
Github Actions with Pytest and K8S
this was one of the first approaches that we started exploring due to its promise of scalability and great community support around setup
Motivation for sharding manually:
some of our initial efforts were spent on using xdist
or a similar tool to automatically collect and evenly split our tests across multiple runners to run tests in parallel.
We looked into the possibility of using xdist
to ssh into worker instances and execute the tests but with Github actions supporting self-hosted runners and providing the support for queuing the jobs and utilizing its newer checks based API for check runs seemed like a way better strategy than setting up custom Github Webhooks to receive Github events to trigger test runs, we’ve been using a similar strategy with our Jenkins based test infrastructure and we’ve had some major issues with that approach where one of our unit test shards won’t report back the test results somehow and we couldn’t deploy that specific commit due to GoCD rules.
We also tried pytest-split to collect and evenly divide the tests across multiple Github workflow matrices so Github can run those tests in parallel by utilizing Github’s matrix strategy we could split up test shards into matrices and take advantage of this feature to run the tests for those shards in parallel. One of its features that collects the metrics for test execution and apparently would improve the execution time for subsequent runs didn’t work and we were left with some shards taking a significantly long time to finish as compared to other shards and we ended up waiting ~20 minutes for the entire build to finish
Why did we choose this sharding strategy:
We came up with a list of shards after spending some time recording test execution times, we’ve identified some of the slowest parts of our codebase for example lms/djangoapps/courseware
is one of the slowest shards currently that takes ~12 minutes to finish, we’ve allocated a single shard to run tests for this app, we could’ve gone further into it and divided up the tests to achieve better performance and better utilize other shards, but by doing that we’d soon run into issues and wouldn’t be able to maintain this in the long term since any newly added tests would have to be manually added into shards, we’re generally discouraged to add new Django apps in edx-platform and this isn’t something that is done frequently. Currently, we’ve some shards that finish relatively earlier than others but we’re following rules for setting up shards
We’ve listed Django apps in test shards in the same order in which they appear in our codebase, except one app under
opened
(openedx/core/djangoapps/cors_csrf/
) that had to be placed out of order since one of its tests was failing otherwiseWe don’t divide up a Django app across shards or in the same shard
We don’t mix apps from one top-level module into other shards (e.g no
lms
app can be added to acms
shard) to keep the sharding relative simpler and easier to understand
How do I know all tests are being run with this strategy:
We’ve added a new Github workflow-based check that makes sure that the number of tests running with this sharding strategy is exactly the same as the total number of tests available in our entire tests suite.
It runs pytest --collect-only
to collect the tests count and compare the count, it’d fail that check if there’s a mismatch between the number of tests, this would make sure if there are any new test/apps added and aren’t covered by these shards
Where should I place my new Django app:
Keep in mind the sharding strategy discussed above, with that, please follow these steps while deciding where to place your app
Place your app in a shard that has a relatively lower time to finish
Try and place your app in the same order in which it appears in the codebase
If a shard is taking ~11 minutes to finish try and place your app in another shard with the same top-level module (i.e an app under
lms
should be placed under one of thelms
shards)If you notice a shard is becoming a bottleneck and taking ~13 minutes to finish, please raise an issue in:
#cc-edx-platform in the openedx Slack
#tech-arbi-bom if you have access to 2U’s Slack
We’ll be adding tooling to automatically detect and report if any of the shards is exhausted and taking significantly longer to finish
Why did we choose Kubernetes based runners:
We decided on using Kubernetes based self-hosted runners after some initial POC in exploring some of the Github action runners developed by the community and we chose actions-runner-controller because of the following advantages it offers
It offers horizontal autoscaling of runners (based on Kubernetes's horizontal pod autoscaling) to quickly autoscale and create new runners from our runner docker image
It offers a Github webhook server out of the box for quicker autoscaling based on Github events (e.g PR events, check runs, etc.)
It offers a robust and scalable infrastructure setup by utilizing and building on top of Kubernetes great features (e.g auto retries on failed instances, quick recovery, ability to specify resources required by our runners giving us the flexibility to scale the infrastructure in the future and utilize the available resource to an optimal level)
Quick autoscaling with EKS, spinning up new runners take around 10 seconds to spin up a new runner pod
Autoscaling our cluster with Kubernetes’s cluster autoscaling features achieving quick autoscaling and taking down instances to save us cost when we’re not utilizing those nodes
Since we can run our custom images as runners, we can take advantage of building up runner images in a way that already has the required dependencies to run our tests, runner images are cached by EKS and spin up new containers relatively quickly
New runner docker image
We’ve built a new lightweight docker image (located here: https://github.com/edx/edx-platform/blob/master/scripts/ci-runner.Dockerfile ) that runs our tests suite and works as a self-hosted runner.
As discussed above we install all necessary dependencies when building the image to reduce the time it takes to set up dependencies while running tests.
One of the significant advantages that we’ve with our custom runner over Github’s runners is when we install edx-platform’s python requirements, it takes ~3 minutes to install these requirements when we’re using Github’s runners adding 3 additional minutes to each shard’s test execution compared to our custom image that takes ~25 seconds to do the same. We can also periodically rebuild this image to keep the installation of the requirements at their latest. We have got a new Github Actions based workflow that we can run to build and publish this image to the docker hub.
Alternatives:
Github Actions with Paver and EC2
in this approach, we utilized our existing AMIs to spin up EC2 instances and set up those instances as self-hosted runners for our Github actions, we used the Github actions matrix strategy to run around all our unit tests parallel on 12 shards (a strategy that we had already used for setting up self-hosted runners with Kubernetes runners), it took almost 20 minutes for our test suite to finish, concerns with maintaining such setup was
paver tasks are way too complicated and too specific with our existing unit test setup
we didn’t have a quick and reliable way to provision EC2 instances on the fly
our existing AMIs for running unit tests are way too bloated to be set up as lightweight action runners
Github Actions with Pytest on Github’s runners
this has been one of the major considerations during this process to utilize Github’s infrastructure and use our existing findings and spin up enough workers in parallel to achieve a reasonably better execution than Jenkins based setup, we were able to run our entire suite in ~20 minutes with 12 workers, we’ve had trouble getting optimized output using this approach, we had following problems with this approach
Not enough parallel workers, to fulfill our demands
we’ve 60 parallel workers available as per our Github plan, we’re already been using Github action runners for a lot of tasks across our repositories, for CI, building docker images, running upgrade jobs, etc. and more effort has been underway since then to help other teams move away from Travis, moreover, we had recently switched our quality checks over to Github actions and it has been utilizing ~6 workers, with this pace we’d frequently run into issues where the PRs would end up in long queues
workers are general-purpose, have limited resources to exploit (limited CPU capacity per worker)
Future Plans
We’ll be iterating through this infrastructure and sharding setup to optimize this better by updating the infrastructure set up to support a lot more PRs to run the checks in parallel and updating the shards to utilize the available resources optimally resulting in an even improved CI performance.
Github has plans to allow us to specify resources for runners (i.e specifying CPU count and memory) on their enterprise plan, we might want to switch over to using the runners provided by Github instead if we see that we can scale up and achieve better performance.