Page Comparison

...

How do I know I have encountered a flaky test?

Tip
Before you type "jenkins run xxx", copy the link to the failed build just in case you need to file a ticket. Alternatively you can later go to the job in Jenkins (e.g. the edx-platform-bok-choy-pr job) and find the old builds for your PR using the search bar in the Build History pane.

If you have encountered a test that both fails and passes on the same commit in Jenkins, then the test is flaky.

If the flakiness is related to the code changes in your PR:

Ensure you didn't introduce a new flaky test. If so, fix it.
Ensure your code doesn't have a bug related to timing. If it does, fix the bug.
Ensure your code didn't cause a test to become flaky. If it did, either fix the test or follow this process as appropriate.

If the flaky test is unrelated to your code changes, follow the rest of this process in order.

Step 1: File a flaky bug ticket in Jira:

Check if someone is already following this flaky test process at the same time:
1. Search for an existing JIRA ticket about this issue. It's probably easiest to use the Known Flaky Tests JIRA query.
2. Search on Github to see if the test was already deleted.

Create (or update) a JIRA ticket with the following information:

File a ticket here
Summary field: Something like "<testclass testcase> fails intermittently"
Labels: At a minimum "flaky_test" (this will make it show up using the Known Flaky Tests JIRA query)
Platform Area: Platform Areas & Product Components: Full Listing (if you only know the platform area and not the product component, that's fine!)

Description: Include the following:

A link to a failed build in Jenkins.
1. Pin the build by pressing “Keep this build forever“, otherwise it will soon disappear
A link to a passing build in Jenkins for the same commit.
The Error Message and Stacktrace for a failed test.
1. You should include this text in the JIRA ticket so it is searchable to others.
2. Wrap each of these in a `{noformat}` macro so it shows up nicely in JIRA.
(optional but super, super helpful) For BokChoy tests, include the screenshot that was captured at the time that the test failed. To find this:
1. In the center pane of your Jenkins build, look at the section entitled "Test Result". Each failure will have line in the following format: "Run tests / <number> / <test name>". The number references the shard on which this test was run.
2. Click the "Build Artifacts" link in the center.
3. Navigate to "test_root/log/shard_<shard number>".
4. Download the png and attach it to the JIRA ticket

Putting it all together you'll be entering something like this:

Code Block

This test fails intermittently and has been removed from the codebase.

For triaging this bug:
* Until proven one way or the other, it could be either the code that is flaky or the test that is flaky.
* The test has been removed from the codebase, and thus the functionality is no longer covered by bok-choy test.
** Evaluate whether or not it is or can be covered by a lower level (e.g. python or JS unit test) and thus the test was unnecessary, or if there is now risk that a bug in the code could escape and thus the test should be fixed and re-enabled.

TestClass:test_case failed in [this build on jenkins | https://build.testeng.edx.org/job/edx-platform-all-tests-master-flow/999/] and passed in the [subsequent master build | {link}].

Error Message
{noformat}
Whatever the error message was for the testcase.
{noformat}

Stacktrace
{noformat}
The stacktrace from the test result.
{noformat}

Step 2: Make a PR deleting the flaky test(s):

Delete the flaky test(s) in a single commit in a new PR (so that it is easily cherry-picked by others.)

Include the Jira ticket ID in your commit message, e.g.:

Code Block
test: Delete flaky test TEST::METHOD Deleted according to flaky test process: https://2u-internal.atlassian.net/wiki/spaces/TE/pages/12812492/Flaky+Test+Process Flaky test ticket: CR-99999999

Note that temp: could be used in place of test: for the commit type if you assume it will be fixed and restored.
If you have any helpful thoughts to add, feel free to comment on the PR.

Place a link to the PR back in your Jira ticket:

Code Block
The test was [deleted in this PR\|https://github.com/edx/edx-platform/pull/99999].

Get a passing build and at least 1 review before merging to master.
1. Note: It would be best if the reviewer has some idea about the relative importance of the test and could help prioritize the ticket to fix the test.

Handling a flaky test JIRA ticket

If you are reviewing a flaky test JIRA ticket, you may want to consider the following:

Is the test necessary? For example, is it a bok-choy test that makes more sense in a lower part of the pyramid (e.g. python or javascript unit tests)?
Especially in bokchoy, is the test covering a longer flow where only one part is flaky? Maybe you can reduce the amount of the test that needs to be deleted.
If the test is of debatable usefulness, consider time-boxing the effort to fix and closing the ticket if it takes too long.

When it’s no longer needed, unpin the Jenkins build by pressing “Don’t keep this build forever“.

Tips for fixing flaky bok-choy tests

See Why is my bok choy test flaky? for some common root causes of flakiness.
See Examples of Fixes for Flaky Tests for some examples of how flaky tests have been fixed in the past. Feel free to add to this.
Use the flaky decorator to run the test multiple times in a row (like below), using the min_passes and max_runs options. This will allow you to run them multiple times locally more quickly, or on Jenkins with only one build. Do not forget to remove the decorator before merging though.
If you are running this locally, and have been working in to codebase for a while, you might remember we used to need to run with the paver argument --extra_args=--with-flaky. This is no longer necessary because we have switched from nose to pytest for the test runner.
Due to setup-related issues, in some cases depending on how the test was written, a second attempt at the same testcase will always fail. You can check if this is the case by temporarily instructing the test to run twice with something like this and running in devstack: @flaky(max_runs=2, min_passes=2).
It is suggested that you update and write tests using a browser other than firefox. Note that Chrome and PhantomJS come for free with devstack, but there are a couple gotchas.
For a fast debug/make changes/test changes cycle, use these tips on running in Pycharm.

Code Block
# TODO: Remove flaky decorator before merging to master. @flaky(max_runs=20, min_passes=20)

The @skip decorator

The @skip decorator is appropriate only in rare cases. For example, tests might be skipped where the tests don't apply for a subclass like in this sample code.

For all other cases, you probably want to follow a similar process to this flaky test process and delete the test from our codebase.

Splunk report for test flakiness

You will need access to splunk.edx.org.
- If you don't already have it, please file an IT support ticket.
- Note that splunk.edx.org is only available when connected to the aws (admin) VPN. For more information about this VPN see https://2u-internal.atlassian.net/wiki/display/EdxOps/How+to+Connect+to+the+edX+VPN
The Provably Flaky Test Methods Report is ~~useful for finding out if a test both passed and failed on the same commit SHA~~ currently broken
- For the interim you can use this report (but be gentle with the time range; any more than 24 hours is likely to blow splunk's memory limits)

Versions Compared

Old Version 1

New Version Current

Key