OP8: Is Redundancy Avoidable?

Cristian Zamfir

Redundancy will occur when parallelizing any exploration task, such as symbolic execution; this holds for any real-world program, that uses multi-threading and has a complex control flow graph.

One example of redundancy is due to memory that is written but not read anymore. We call this situation unused memory diff (UMD) For instance, code with complex control and data flow can lead to a situation in which, even though the memory of the program is different between two states explored on different workers, the two states will generate identical subtrees. The reason is that the memory differences between the two states do not influence the control flow of the program.

Another example is due to different constraints that can accumulate on two different paths that end up creating the same state. For instance, going through a loop twice or three times could create the same memory in two different states, but these states cannot be merged because the number of times a loop is executed generates different constraints on program inputs. Finally, one other source of duplication comes from exploring commutative thread interleavings.

However, redundancy can be avoided using the following proposed technique, called lazy state redundancy detection. The main idea is to eliminate most (but not necessarily all) redundant work with as little coordination as possible. We use a distributed hash table (DHT) optimized for bulk key lookups (that is multiple key lookups can be done efficiently in the same request to the DHT). The DHT keeps signatures of the states explored by each worker. When updating the DHT, a worker node can check if any of the states it currently explores are explored by another worker.

A signature is designed to be small and to overcome the limitations outlined previously (i.e., UMF). It contains samples of the program counter, memory areas, and constraints on program input, at various well established intervals (details about how these intervals are established is beyond this OP). These values are then hashed to form a signature and then uploaded to the DHT.

The more samples are taken, the more precise the signature is; however, in the extreme case, when a state is sampled at every instruction, UMF will not detected. Moreover, if too few samples are taken, many states could appear similar. Therefore, this is a tradeoff between efficiency and completeness of the exploration. We can think of these signatures as a randomized hash. Most states will greatly differ between each other, therefore a randomized hash will also be very likely to differ, therefore a signature could be used efficiently to detect redundancy.

Lazy state redundancy detection works by periodically updating signatures in the DHT and checking if any of the currently explored states are likely to be duplicate. The duplicate states can be removed from the execution tree of one of the workers, or can be de-prioritized by the search strategy. In the latter case, no unsound pruning is performed.