Test Multi-Execution

Editoral Note: this is a follow up to my earlier Principles of Test Oriented Software Development post.

In software development, we write tests to make sure the code we write does what we want it to do. Great this is pretty easy to get behind.

Tests sometimes fail.

The goal, is that, most of the time when tests fail, it’s because the code is broken: you fix the code and the test passes. Sometimes when test fail there’s a bug in the test, it makes an assertion that can’t or shouldn’t be true: these are bad because they mean the test is broken, but all code has bugs, and test code can be broken so that’s fine.

Ideally either pass or fail, and if a test fails it fails repeatedly, with the same error. Unfortunately, this is of course not always true, and tests can fail intermittently if they test something that can change, or the outcome of the test is impacted by some external factor like “the test passes if the processor is very fast, and the system does not have IO contention, but fails sometimes as the system slows down.” Sometimes tests include (intentionally or not) some notion of “randomnesses,” and fail intermittently because of this.

A test suite with intermittent failures is basically the worst. A suite that never fails isn’t super valuable, because it probably builds false confidence, a test suite that always fails isn’t useful because developers will ignore the results or disable the tests, but a test that fails intermittently, particularly one that fails 10 or 20 percent of the time, means that developers always will always look at the test, or just rerun the test until it passes.

There are a couple of things you can do to fix your tests:

write better tests: find sources of non-determinism in your test and rewrite tests to avoid these kinds of “flaky” outcomes. Sometimes this means restructuring your tests in a more “pyramid-like” structure, with more unit tests and fewer integration tests (which are likely to be less deterministic.)
run tests more reliably: find ways of running your test suite that produce more consistent results. This means running tests in more isolated environments, changing the amount of test parallelism, ensure that tests clean up their environment before they run, and can be as logically isolated as possible.

But it’s hard to find these tests and you can end up playing wack-a-mole with dodgy tests for a long time, and the urge to just run the tests a second (or third) time to get them to pass so you can merge your change and move on with your work is tempting. This leaves:

run tests multiple times: so that a test doesn’t pass until it passes multiple times. Many test runner’s have some kind of repeated execution mode, and if you can combine with some kind of “stop executing after the first fail,” then this can be reasonably efficient. Use multiple execution to force the tests to produce more reliable results rather than cover-up or exacerbates the flakiness.
run fewer tests: it’s great to have a regression suite, but if you have unreliable tests, and you can’t use the multi-execution hack to smoke out your bad tests, then running a really full matrix of tests is just going to produce more failures, which means you’ll spend more of your time looking at tests, in non-systematic ways, which are unlikely to actually improve code.