For detecting complex expressions, we provide the following heuristics:
1. solver timeouts. Each time a solver takes too much time, we will
decide if the current function seems to be the culprit and we will
mark it as a candidate for concretization. We will use statistical
methods for outlier detection (a simple threshold will not work as it
will start cutting off all long paths). In case our engine is
EXE-like, we can track solver time for prefixes of each path and just
see when the complexity starts to grow more rapidly. In case of
DART-style symbex, our proposal is to fire parallel solver queries at
function boundaries if the original query takes too much time.
2. static analysis of code. One of the ideas here is to capture
different features from source code expressions (number of
multiplications, size of expressions, mixing of variables, etc.) and
try to use machine learning to recognize hard-to-solve patterns. We
will collect samples from real code, run the solver on time and learn
the expected time. Another alternative is to take an SMT solver, feed
the expressions to it and get the size of the problem passed to an
underlying SAT. The intuition here is that SMT solvers try to optimize
SAT queries as much as possible before sending them to the SAT solver
and thus the size passed to the SAT solver is a good estimate of
hardness. For path-exploding functions we suggest to calculate function
summaries first. If during the summary generation we observe too many
paths with different outcomes, the function is a good candidate. In
practice, a single threshold will not work and we suggest to use
thresholds dependent on the nesting level of functions. The second part of the OP asks for capturing all input-output pairs,
but that is clearly infeasible. Thus, we suggest sampling the
function. One simple way is to sample random inputs and corresponding
outputs. A more clever way is to determine all paths up to some depth
and then sample independently on each path. Sampling on a path
corresponds to solving the path constraint combined with constraints to force new
solutions. As the solutions can be heavily influenced by the SMT solver,
we may moreover try to fix a few random bits of the input/output to obtain a more uniform distribution. To summarize, we believe that function concretization is beneficial
for functions creating too much complexity by creating either complex
expressions or too many paths. For both of these cases we provide
suitable techniques, which can be applied to perform an almost
fully-automatic analysis. |