← Back to context

Comment by hansvm

3 months ago

When flakes are caused by bugs, the asymptotics are much worse in large software projects. Ignoring context and individualized distributions for each test for simplicity (in the spirit of the blog post), any given unit of buggy code has some probability p of causing a flake. When that code is used n times (or, more abhorrently, when there are n levels of p-flaky code) in the process, you have a 1-(1-p)^n chance of a given test failing. Apply the flake equation, and you get a runtime of:

O(f exp((1-(1-p)^n) f))

Your worst-case asymptote is still O(f exp(f)), but you hit it incredibly quickly because of the additional exponential in p. A single additional layer of abstraction (calling the same buggy subroutines a few more times) is all it takes to tip the scales from a manageable testing environment to a flaky time-suck.

It's not just in testing. If your software has those sorts of problems and isn't designed from the ground up to reduce the impact of errors (like how the shuttle program was built hierarchially in terms of redundant subsystems, and the flakiness of individual parts didn't have an exponential impact on the flakiness of the composite system), the same equations dictate that you'll have a critical threshold beyond which you have customer-facing bugs you can no longer ignore.

The author argues a bit against reducing the baseline flakiness (requiring expensive engineers for an unbounded set of potential issues). That's probably a reasonable approach for cosmic rays, rowhammer, and whatnot for most software applications. Compared to writing the code in the first place though, it's usually very cheap to also (1) name the function well, (2) make the function total (behaving reasonably for all inputs allowed by the type system), and (3) add a couple tests for that function. The extra development velocity you get from not building on shaky foundations is usually substantial in comparison. The second you have to worry about a transitive bug in the stack you have an exponential explosion in code-paths to examine when the feature you're building has any sort of reliability requirements, and those exponential costs add up quickly compared to a flat +10%, +30%, +100%, ... in time to commit each solid unit of code.