OP1: A Smarter Quality Metric for Testing

Jonas Wagner

The goal of a test quality metric is to quantify to what degree a test suite covers all interesting events that can occur in a program. The difficulty is in defining what an "interesting" event is.

For classic test quality metrics, the execution of a basic block is the only interesting event. This makes a lot of sense, because (1) every block in a program should be reachable, and (2) some bugs manifest themselves unconditionally when a particular block is executed.

There are, however, other interesting cases that block-based metrics do not take into account; for example:

  1. Each pointer dereference should be checked. A good test suite should verify that the program uses pointers in a safe way.
  2. Similarly, a test should check that programs' array accesses are checked to ensure that the index is within the bound of the array.
  3. Operations should be checked for illegal arguments, e.g., by ensuring that a divisor can never be zero.

For each of those criteria, there can be three possible states. First, there can be a test that proves that the interesting event does occur. For example, the test suite may contain a set of inputs that leads to a pointer being zero. Second, the state of an interesting event may be unknown. This is the case when the test suite does not contain any test for which the event happens. The third case is when we are sure that the event can never happen. This is difficult: it can only be achieved by exhaustive testing of all possible program executions, or by using proof techniques that ensure the event can never be produced.

With these points in mind, I propose the following quality metric: First, the users define what events are interesting for their project. This may depend on the programming language used, on project requirements, or on the features of the test toolkit. Then the test system counts the number of places in the program where these events might happen. Finally, it runs the test suite and displays the fraction of these event locations that, as a result of running the tests, (1) caused an event to occur, (2) did not occur, and (3) were proven to never occur.

In contrast to metrics based on symbolic execution, this metric has one advantage: for all the interesting events (basic blocks, illegal pointer dereferences, illegal array accesses, illegal arguments, etc.), the number of program locations where these could occur is finite and linear in the size of the program. This makes the fraction of event locations with a certain property a meaningful quantity. On the other hand, the number of input equivalence classes or the number of possible paths through a program is exponential in program size or even unbounded, makin such metrics complex and hard to understand.