Flaky Test Process


Overview

Having dependable test suites is a requirement for decreasing time to value through more frequent edx-platform deployments. This translates to flaky tests no longer being an acceptable nuisance.  Instead, our tests need to be 100% trustworthy in our Continuous Integration (CI) and (near)-Continuous Deployment (CD) systems.

What is a flaky test? A flaky test is one that sometimes passes and sometimes fails. Most flaky tests are flaky because of how the test was written, and not due to an actual bug. However, that is not a certainty.

The flaky test process is to delete the flaky test using the process defined below.

Consequences of this process

This process has the following consequences:

  • Test suites become dependable. We can rely on them for deployments. You can rely on them with your PRs.

  • We save lots of people and compute resources that were wasted on flaky tests.

  • We no longer pretend like flaky tests are a safety net against bugs.

  • There is a potential increase in risk for improved time to value.

  • Product development teams will continue to balance the risk vs reward of fixing the test, and determine how to move forward, without the above costs.

How do I know I have encountered a flaky test?

If you have encountered a test that both fails and passes on the same commit, then the test is flaky.

If the flakiness is related to the code changes in your PR:

  • Ensure you didn't introduce a new flaky test. If so, fix it.

  • Ensure your code doesn't have a bug related to timing. If it does, fix the bug.

  • Ensure your code didn't cause a test to become flaky. If it did, either fix the test or follow this process as appropriate.

If the flaky test is unrelated to your code changes, follow the rest of this process in order.

Step 1: File a flaky bug ticket in Github:

  1. Old flaky_test tickets still need to be migrated from Jira to Github.

    1. See 2U-private link: https://2u-internal.atlassian.net/issues/?filter=10600

  2. As of Oct 8, 2024, the flaky-test label is being added in https://github.com/openedx/repo-tools/pull/566, but had not yet landed.

  1. Check if someone is already following this flaky test process at the same time:

    1. Search for an existing Github ticket about this issue using the flaky-test label.

      1. Github search for flaky tests.

    2. Search on Github to see if the test was already deleted.

  2. Create (or update) a Github ticket with the following information:

    1. Create a ticket in the corresponding repo.

      1. [idea] Consider creating an openedx Issue template for flaky tests.

    2. Title: Something like "<testclass testcase> fails intermittently"

    3. Labels: At a minimum flaky-test label (this is needed for searches)

    4. Description: Include the following:

      1. A link to a failed build in GitHub Actions.

      2. A link to a passing run for the same commit.

      3. The Error Message and Stacktrace for a failed test.

        1. You should include this text in the Github ticket so it is searchable to others.

        2. Wrap each of these using code-formatting in the text.

      4. Putting it all together you'll be entering something like this:

        This test fails intermittently and has been removed from the codebase. For triaging this bug: - Until proven one way or the other, it could be either the code that is flaky or the test that is flaky. - The test has been removed from the codebase, and thus the functionality is no longer covered by CI tests. . - Evaluate whether or not it is or can be covered elsewhere and thus the test was unnecessary, or if there is now risk that a bug in the code could escape and thus the test should be fixed and re-enabled. TestClass:test_case failed in [this build on GitHub | https://...] and passed in the [subsequent build | {link}]. Error Message ``` Whatever the error message was for the testcase. ``` Stacktrace ``` The stacktrace from the test result. ```

Step 2: Make a PR deleting the flaky test(s):

  1. Delete the flaky test(s) in a single commit in a new PR (so that it is easily cherry-picked by others.)

    1. Include the Github link in your commit message, e.g.:

      test: Delete flaky test TEST::METHOD Deleted according to flaky test process: https://openedx.atlassian.net/wiki/spaces/AC/pages/4306337795/Flaky+Test+Process Flaky test ticket: - https://github.com/openedx/edx-platform/issues/11111
    2. Note that temp: could be used in place of test: for the commit type if you assume it will be fixed and restored.

    3. If you have any helpful thoughts to add, feel free to comment on the PR.

  2. Place a link to the PR back in your Github issue:

    The test was [deleted in this PR|https://github.com/openedx/edx-platform/pull/99999].
  3. Get a passing build and at least 1 review before merging to master.

    1. Note: It would be best if the reviewer has some idea about the relative importance of the test and could help prioritize the ticket to fix the test.

Handling a flaky test Github issue

If you are reviewing a flaky test Github issue, you may want to consider the following:

  • Is the test necessary? For example, is it a cypress test that makes more sense in a lower part of the pyramid (e.g. python or javascript unit tests)?

  • Is the test covering a longer flow where only one part is flaky? Maybe you can reduce the amount of the test that needs to be deleted.

  • If the test is of debatable usefulness, consider time-boxing the effort to fix and closing the ticket if it takes too long.