The Tyranny of the Flake Equation

9 days ago (moderndescartes.com)

When I was still a student I worked mostly with Java and had to solve a simple word frequency problem for a series of assignments. The first several assignments everything worked as expected, and then one case in the test suite kept crashing the JVM or freezing it entirely.

It was then that I learned that String is immutable in Java and that when I tried to ingest a novel into a string one token at a time like str += token; I was destroying/recreating the entire string each time, which was fine during the small cases, but had a hidden exponential effect.

It was then I learned about StringBuilder and laughed that such a seemingly innocent line would contain such a large footgun. As a student, I had absolutely no way of learning about language semantics like that except through that mistake.

  • > had a hidden exponential effect

    Quadratic, I believe. You create a sequence of strings of length 1, 2, 3, ..., n, which is about n^2/2. Of course it isn't that simple (memory alignment and all that), but it doesn't look exponential.

    • Good point but this is actually a very interesting problem that has me digging through the implementation of String again

  • I also made this mistake once upon a time.

    I like this since it's such a good example of how bad API design can mislead the developer and lead to subtle mistakes with substantial consequences. It is right for the JVM to have special handling for strings due to their ubiquity and unique characteristics, but it should be reflected in the API design to avoid encouraging horrible mistakes like that.

When flakes are caused by bugs, the asymptotics are much worse in large software projects. Ignoring context and individualized distributions for each test for simplicity (in the spirit of the blog post), any given unit of buggy code has some probability p of causing a flake. When that code is used n times (or, more abhorrently, when there are n levels of p-flaky code) in the process, you have a 1-(1-p)^n chance of a given test failing. Apply the flake equation, and you get a runtime of:

O(f exp((1-(1-p)^n) f))

Your worst-case asymptote is still O(f exp(f)), but you hit it incredibly quickly because of the additional exponential in p. A single additional layer of abstraction (calling the same buggy subroutines a few more times) is all it takes to tip the scales from a manageable testing environment to a flaky time-suck.

It's not just in testing. If your software has those sorts of problems and isn't designed from the ground up to reduce the impact of errors (like how the shuttle program was built hierarchially in terms of redundant subsystems, and the flakiness of individual parts didn't have an exponential impact on the flakiness of the composite system), the same equations dictate that you'll have a critical threshold beyond which you have customer-facing bugs you can no longer ignore.

The author argues a bit against reducing the baseline flakiness (requiring expensive engineers for an unbounded set of potential issues). That's probably a reasonable approach for cosmic rays, rowhammer, and whatnot for most software applications. Compared to writing the code in the first place though, it's usually very cheap to also (1) name the function well, (2) make the function total (behaving reasonably for all inputs allowed by the type system), and (3) add a couple tests for that function. The extra development velocity you get from not building on shaky foundations is usually substantial in comparison. The second you have to worry about a transitive bug in the stack you have an exponential explosion in code-paths to examine when the feature you're building has any sort of reliability requirements, and those exponential costs add up quickly compared to a flat +10%, +30%, +100%, ... in time to commit each solid unit of code.

Just like a basic part of knife safety is "don't grab the pointy end", a basic part of computing is "don't grab the exponential end".

(recognising when you might have a hold of an exponential by the wrong end can be done by experience, experiment, or exercise of theory, so there isn't much of an excuse for continuing to do it)

One trick that works surprisingly often when you don't understand the root cause: scale the number of batches, rather than the batch size (think merge sort or map/reduce).

If 10% of your batches flake out, you rerun them and expect 1% to have to be run a third time, etc., giving you an O((batch run time)*log(number of batches)) expected wait until you can merge/reduce. Making the batches smaller means you have a shorter wait to merge/reduce (but also possibly worse merge time, so it doesn't always work; YMMV, etc.).

Not sure about the equation at the end. Looks like you may have incorrectly thought that e^(x/y) = e^x / e^y.

I get wanting to get a sense of the runtime compared to the time spent checkpointing but the equation as stated is wrong. A better one is

Ne^(l/f) / Cf = f/l

Which basically means the ratio between checkpointing and runtime (including retries) should be inversely proportional to the ratio of checkpoints and flakes.

Also really, you're calling the flakiness parameter lambda and the checkpointing parameter f?

Is this really a Poisson process?

Assuming that every Job succeeds with the same probability -> Bernoulli Process.

The probability of overall success would still be exponential (p^n).

This is why the erlang OTP is so effective: it pushes you toward error (flake) recovery and idempotency.