Coverage-guided Test Data Selection

Michael Seifert, 2021-05-08, Updated 2021-05-28

Selecting meaningful test data

There are several possibilities to select a meaningful subset from the test data. If the test data has an inherent structure, that structure can be leveraged to select subsets. The training data set for a machine learning image classifier, for example, will likely require test data for each image class it is supposed to recognize. Another approach is to run some statistics on the full data set and determine a representative subset. However, if you are unlucky you might miss that one specific corner case hidden in the data set. The following sections present an approach that uses code coverage metrics to determine a representative test data subset. The general idea is to generate one test case per data point, determine the code coverage produced by each test case, and find a smaller set of data points that results in the same amount of coverage. We will use pytest and coverage to achieve this goal.

Generating test cases with pytest

Let's assume we implemented a function that tests if a given integer is prime. I came up with an implementation that handles a bunch of special cases. The function first checks that the prime candidate is smaller than, equal to, or divisible by two and immediately identifies the number as prime or rejects it for each respective case. The function then tests for divisibility for all odd integers up to and including the square root of the number. If our potential prime is divisible by any of those, we can say for sure it is not prime. Otherwise, all checks have passed and the number is prime:

from math import sqrt

def is_prime(number: int) -> bool:
    if number < 2:
        return False
    if number == 2:
        return True
    if number % 2 == 0:
        return False
    last_possible_divisor = int(sqrt(number))
    for possible_divisor in range(3, last_possible_divisor + 1, 2):
        _, rest = divmod(number, possible_divisor)
        if rest == 0:
            return False
    return True

The function contains a number of if-else clauses. We want to test it thoroughly to make sure it works correctly. For that purpose, we got our hands on a data set which states whether a number is prime for all integers between 0 and 99. The individual files are named prime-xx.txt, where xx is a two digit integer with leading zeroes. Each file consists of two lines, the first line states the number and the second line is either True if the number is prime or False otherwise:


An straight-forward way to assert the correctness of our code is to write a single test case that checks the implementation against the full test set:

def test_is_prime_returns_correct_result(prime_files: Iterable[Path]):
    for prime_file in prime_files:
        lines = prime_file.read_text()
        number = lines[0].strip()
        expected_prime = lines[1].strip() == "True"
        assert is_prime(number) == expected_prime

This approach is not suitable for us, though, because the coverage produced by all files will be aggregated in a single test case. There is no way to determine which test file covers which lines. We can easily address the issue, though, by writing a small Pytest plugin that generates test cases from the test files. For this purpose, we create a file and implement the pytest_collect_file (opens new window) hook. The hook function is invoked for every file discovered in the test collection phase which is one of the first actions performed by Pytest. When Pytest is called without any arguments it starts in the current working directory and descends down into subdirectories. It will find our test files and invoke pytest_collect_file for each of them.

import py
import pytest

from prime import is_prime

def pytest_collect_file(parent, path: py.path.local):
    if path.basename.startswith("prime") and path.ext == ".txt":
        return PrimeTxtFile.from_parent(parent, fspath=path)

class PrimeTxtFile(pytest.File):
    def collect(self):
        lines = self.fspath.read_text("utf-8").splitlines()
        number = int(lines[0].strip())
        is_prime = lines[1].strip() == "True"
        yield PrimeTest.from_parent(self, name=f"is_prime({number})", number=number, expected_result=is_prime)

class PrimeTest(pytest.Item):
    def __init__(self, name, parent, number, expected_result):
        super().__init__(name, parent)
        self.number = number
        self.expected_result = expected_result

    def runtest(self):
        prime = is_prime(self.number)

        assert prime == self.expected_result

The hook implementation identifies our test files based on file name. If a test file is found, it delegates the work to the PrimeTxtFile collector. A collector represents a means of grouping test cases. The pytest_collect_file hook has to choose a single collector per file, but Pytest collectors can delegate work to sub-collectors, thus forming a hierarchy. That means, our test data could just as easily be read from a compressed archive or from a directory structure. It is usually a good idea to put file-format specific parsing logic into the collector.

The parsed data is eventually passed to our test case implementation PrimeTest, which inherits from pytest.Item. A test item represents an actual test case that can succeed or fail. Pytest should now create individual test cases for our test data set:

$ pytest -q
....................................................................................................                                                                                                                                                                   [100%]
100 passed in 0.12s

Note that Pytest now identifies 100 test cases as opposed to just a single one in our initial implementation.

Selecting a subset of tests based on coverage metrics

The tests finish very quickly in our contrived example, but this is not always the case. For the sake of this article, let's assume that test runtime is much longer than we can tolerate. We need to extract a representative subset from the test files. For this purpose, we will measure the code coverage produced by each test case and select a subset that produces the same amount of coverage as the full data set.

This can be achieved by leveraging's contexts (opens new window). A context is like a "namespace" for coverage data. Using contexts, we record the coverage data for each test separately. This allows us to isolate the effects of individual tests on code coverage. Contexts can be merged later to get a combined coverage report. Lucky for us, the pytest-cov plugin provides the --cov-context=test option (opens new window), which does exactly what we want:

$ pytest -q --cov=prime --cov-context=test
....................................................................................................                                                                                                                                                                   [100%]

----------- coverage: platform linux, python 3.8.9-final-0 -----------
Name       Stmts   Miss  Cover
------------------------------      14      0   100%
TOTAL         14      0   100%

100 passed in 0.24s uses an SQLite database (opens new window) to store the recorded data. In addition to that, the package provides a Python API (opens new window) to access the data:

from coverage import CoverageData

coverage_data = CoverageData()

contexts_with_maximum_coverage = set()
total_test_coverage = set()
for context in sorted(coverage_data.measured_contexts()):
    for measured_file in coverage_data.measured_files():
        coverage = set(coverage_data.lines(measured_file))
        if coverage.difference(total_test_coverage):

for context in sorted(contexts_with_maximum_coverage):

Remember that each test case is recorded in its own coverage context. Our algorthim tries to find a minimal set of test cases that produce a code coverage equal to the full test set. For that purpose, it builds up a test subset and tracks the amount of coverage produced by it. Starting from an empty set the algorithm iterates over each context and checks if the total amount of code covered increases when adding the current context to our minimal subset. If coverage increases, the current context is added to the minimal set and the coverage produced by the minimal set is updated accordingly. When all recorded test contexts have been processed, the algorithm yields a subset of test contexts which produce a code coverage equal to that of all tests:

$ python


The result shows that five test cases are enough to achieve 100% coverage of our primality test. Incidentally, each of the test cases corresponds to a code path leading to a different return statement in the is_prime function.

Limitations of coverage-guided test data selection

The lines of code under test form a finite set and each test case covers a subset of the lines of code. The goal of our algorithm is to find a minimal subset of test cases that covers the complete code. This problem statement is equivalent to the set-cover problem. The set-cover problem is NP-hard, which means there is no known algorithm that gives the optimal solution in polynomial time.

The algorithm from the previous section represents a heuristic that approximates a result. Even though we may not get the optimal solution from the heuristic, it runs in polynomial time. The number of loop iterations of the algorithm is bounded from above by the number of test cases times the number of files in the code base. All operations in the loop body, such as accessing the coverage data or computing the difference between two hash sets is most likely performed in polynomial time, too.[1]

In some cases, though, the heuristic might not be able to reduce the number of test cases. It is possible to construct a program with n lines of code where each test case covers one additional line compared to the previous one. For example, the first test case covers the first line, the second test case covers the first and the second line, and so on. In this specific case, our algorithm would yield the full set of test cases as a result.

Johnson (1973) (opens new window) and Cormen et al. (2001) (opens new window) present a different heuristic for the set-cover problem that gives a better approximation to the optimal solution. The main difference is that our algorithm chooses an arbitrary test case in each loop iteration, whereas their algorithm chooses the test case which will increase line coverage the most.

Lastly, it has to mentioned that the meaningfulness of code coverage metrics is debatable. This is true for the general case, but even more so in connection with Due to the way tracing historically worked in the Python interpreter, the library's measurements are line-based. Code branches created by ternary operators or short-circuiting operators are not recognized as branches, even when turning on branch coverage measurement. That means the subset of test cases determined by the algorithm above will miss a few code branches in a real code base. This is a known issue (opens new window), but will not likely be resolved anytime soon. The recently released PEP 657 (opens new window) is aiming to address the issue, though.

Wrap up

The article demonstrated a way to leverage code coverage metrics for the purpose of reducing the amount of tests in a test suite. It showed how to write a pytest plugin that generates test cases from a set of files with test data, and presented an algorithm that reduces the number of tests based on code coverage metrics that were obtained using The resulting subset of tests retained the same degree of code coverage as the full test set. We learned that selecting an appropriate subset of tests is an NP-hard problem, and that the algorithm does not yield an optimal result in all cases.

In future work it may be worth exploring different metrics for test selection. The limitations of code coverage metrics were already mentioned in the previous section and it is relatively easy to create ineffective tests that produce a lot of coverage. Rather than evaluting tests based on coverage metrics, we should gauge them based on their ability to detect failures in the program. One possibility is to inject failures in the code via mutation testing (opens new window) and select test cases based on the number of failures they can detect. I expect such a subset of test cases to be very robust.

  1. My analysis concluded that the worst case performance is O(log(|T|)|T|²), where T is the set of all test cases. ↩︎