Out of the Tar Pit (2006) [pdf]

1 year ago (curtclifton.net)

This paper was very influential on me when I first started programming professionally around 2012. I don't plan on reading it again, but my vague memory of what I got out of it is pretty simple and I think has become pretty standard practice at this point: avoid mutable state and use pure functions where possible.

The framing of accidental and essential complexity is of course very useful and not really unique to this paper. The difficulty is there is nothing but experience-informed judgment that can tell you which complexity is accidental and which is essential. There is always a framing of the problem that can justify some bit of complexity as essential. You have to judge.

  • Plenty of state -- even mutable state -- is essential complexity. Think of an IDE where a user is typing out code: the state is changing all the time. Pure functional programming has a hard time walking the line between "avoiding" mutable state, and "ignoring" mutable state. If you insist on functional purity, you're already admitting defeat in the face of essential mutable state. The walls of functional (and especially reactive) programming are too high -- they currently don't play very well with their Turing and von Neumann neighbors.

    Functional programming is only "mathematically pure" because we already have math for it. You can make mutating state "mathematically pure" too, if you just invent new math for it. For example, you can pair state with a "generation" variable and define mutating functions as returning the new state with generation + 1; now all your functions are pure again. That's not all it takes, and it's not easy! But it shows how narrow-minded the current idea of "purity" is.

    • > Think of an IDE where a user is typing out code: the state is changing all the time.

      That's easy mode. Mutability is often justifiable when one person is doing one thing to a computer at a time. Now extrapolate to multiple people editing the same document at once. Suddenly you're discussing CRDTs and other approaches. And now implement undo on top of that!

      Likewise with git, or blockchain, or kafka. They're persistent logs so you can keep your head straight while figuring out what the current state(s) of the system can be. Even with git, when you do an in-place mutation (force-push) there's still an extra log (reflog) behind the scenes trying to keep the changes sane.

      1 reply →

    • There is no essential mutable state in computer science. All essential mutable state can be modeled away to use no mutable state, as you have shown in the generation number idea which is one valid way. (I'm strictly talking about computer science problems not computer engineering problems such as drivers.)

      The generation number idea you have shown is an excellent idea. It immediately enables several new capabilities compared with just mutation:

      * You are now able to determine which data is older and which is newer without resorting to an external clock.

      * If your mutation takes a long time, there is no longer a need to use a long-held mutex to ensure thread safety which would prevent concurrent reads; you can create a new generation separately and then atomically CAS it. Writers don't block readers.

      * If you suddenly need undo/redo functionality, it's suddenly trivial. Or diff'ing between versions. Or understanding the reason for changes (basics of auditing).

      19 replies →

    • I always thought the problem with a "purely" functional view is much more practical: From my limited Haskell experience, I gather that you have to use recursion instead of loops, since loops (and GOTOs) work by iteratively modifying some state, unlike recursion. But humans think in loops. If you look in a cookbook for a recipe, it will almost certainly contain loops and rarely any recursions.

      Recursion programs are provably Turing equivalent to WHILE or GOTO programs, but that doesn't mean they are equally natural for our brain to think about. (Not to mention things like tail recursion, which is even less intuitive.)

      2 replies →

    • > Plenty of state -- even mutable state -- is essential complexity.

      Arguably most of it!

      After all, computer don't compute.

      https://www.youtube.com/watch?v=EKWGGDXe5MA&t=297s

      One of the miseries of life is that everyone names everything a litte bit wrong, and so it makes everything a little harder to understand in the world than it would be if it were named differently. A computer does not primarily compute in the sense of doing arithmetic. Strange. Although they call them computers, that's not what they primarily do. They primarily are filing systems. People in the computer business say they're not really computers, they are "data handlers". All right. That's nice. Data handlers would have been a better name because it gives a better idea of the idea of a filing system.

      2 replies →

    • Ah, but if you model the mutable state as a function taking prior state to new state (and no cheating and modeling it as World -> World), you can constrain and control it into something you can reason about. You're one step closer to taming the dragon of chaos, bucko!

    • IMHO in the context of the Tar-Pit Paper (programming for services and applications rather than electronics) mutation merely is a mean to an end. The end being: some piece of memory holds the value I expect (e.g., to transmit, store, show information). Thus I disagree that mutable state is essential complexity. For the rest of the post I don't understand what you mean with "walking the line between avoiding/ignoring mutable state" and I wish you ad elaborated about what you mean by "play very well with Turing neighbors" because I cannot really connect with that.

      Regarding, functional programming and mutation. In a typed "pure-FP" language like Haskell: - you need to explicitly declare what constitutes a valid value - which drives what valid values are observable between program steps - which informs how corrupted a data-structure can be in your program

      For instance, using a tree-shaped structure like `Tree1 = Leaf Int32 | Node Int32 Tree1 Tree1` and you have some `MutableRef (Tree1)`. You know exactly that from the ref you can read a whole Tree1, you can change the content of the MutableRef but you need to give a whole new Tree1. In particular, you are sure to never read "an half-initialized tree" or "a tree that was changed while I was reading it" because these things just do not exist in the realm of valid Tree1s. Such simple guarantees are really important to many programs but are not guaranteed in most programming languages.

      Of course, if for some reason (e.g., for performance) you need something more flexible and are willing to pay the extra complexity, you can do it. As an illustration, you can augment your Tree1 with some MutableRef as well. `Tree2 = Leaf Int32 | Node Int32 Tree2 Tree2 | Cell (MutableRef Tree1)`. Here Tree2 contains mutable references to a valid Tree1 so Tree2 is not fractally complicated but you could do that too. With Tree2, you can interleave reads, writes, and other operations while working on a Tree2. These extra effects could be required (for performance reasons for instance) at the expense of inviting surprising behaviors (bugs). In pure-FP it is clear that with Tree2 we lost the possibility to "read/write a whole tree in a single program step". Thus, if the control flow of the code is not linear (e.g., multi-threaded, callback-heavy) you may have fun things occurring, like printing a half of the tree at a given time and the second half of the tree after a mutation, resulting in printing a tree that never existed. Enforcing global properties like the heap-invariant becomes inconvenient because it is actually hard if we let mutation in. I'd go as far as saying that the Tree2 doesn't exist as a tree: given a Tree2 we are merely capable of enumerating chunks with tree-shapes piecewise.

      This reply is long already, so I won't go into how Haskell has various flavors of mutable refs in IO, ST, STM and recently even got linear-types to allow some fine-grained composition-of-atomic-mutations/interleaving/prevention of effects. But overall I find pure-FP takes mutability much more seriously than other more mainstream languages. Here, pure-FP just keeps us honest about where non-determinism bites. The collective failure to realize is one of the key reason why we are in The Tar-Pit: violated in-memory-invariants can compound even outside programs by forcing outside/future users to deal with things like misleading information being shown to users, urgently upgrading your fleet of servers, days of one-off data-cleanup and so on and so forth.

    • No. Pure functions are more modular. It allows you to treat your logic like bricks and building blocks. Pure functions are composable purely off of types. Combinators are actually the proper term here.

      Mutable state does not do this. Your idea of generations doesn't make sense, because that means every function must take a generation as the input parameter too. Given a different generation number the function must deterministically create the same output.

      It is not a complete system. Your equations produce this useless generation number that aren't reused as part of the system, it's just there for you and meta analysis.

      That is not to say your first paragraph is completely wrong though. There is a trade off between modularity and efficiency. Mutating a variable is less resource intensive then generating a new variable.

      1 reply →

  • For anyone who is not aware, 'accidental and essential complexity' were coined in Fred Brooks' 1986 paper No Silver Bullet.

    • Chapter 1 of The Mythical Man Month by the same author is called "The Tar Pit" and has this memorable opening paragraph:

      > No scene from prehistory is quite so vivid as that of the mortal struggles of great beasts in the tar pits. In the mind's eye one sees dinosaurs, mammoths, and sabertoothed tigers struggling against the grip of the tar. The fiercer the struggle, the more entangling the tar, and no beast is so strong or so skillful but that he ultimately sinks.

      > Large-system programming has over the past decade been such a tar pit, and many great and powerful beasts have thrashed violently in it...

    • I was very curious about this claim and it seems that's not quite right. Of course the phrase "essential complexity" was applied to non-software topics before Brooks (Wikipedia correctly notes that the distinction between essence and accident goes back to Aristotle), but it seems Brooks was not the first to apply it to software either. I would not deny that he popularized it.

      Book search results for "accidental complexity" between 1920 and 1985: https://www.google.com/search?q=%22accidental+complexity%22&...

      Book search results for "essential complexity" between 1920 and 1985 (a lot more results): https://www.google.com/search?q=%22essential+complexity%22&l...

      Here's a publication from 1981 defining "essential complexity" specifically with reference to computer programs. https://www.google.com/books/edition/Validation_Verification...

      Note that there's another result that claims to be from 1968 but the text of the book says it's from 1996.

      Interestingly "essential complexity" was used earlier and more often than "accidental complexity", probably because the former sufficiently implies the existence of the latter.

      https://books.google.com/ngrams/graph?content=essential+comp...

  • My experience has shown almost all of it to be essential except the stuff that is very obvious "Programmers playing with stuff that seems cool".

    Once you account for hardware errors, human errors, respecting the user's time, compatibility with different file formats, multiplatform support, and the limited resources developers have, etc, most programs seem like they couldn't be any simpler while still being something I want to use.

    Or, they potentially could, but nobody would pay enough to justify the extra time it would take.

    Which is why I don't complain that Etcher is 99MB. It's like conceptual art, Yes, you could do it smaller, but you didn't, and I don't want to because I can just use the existing thing and move on, and I wouldn't pay for a lighter version when the heavier thing is free.

    I just don't see all that much disappointing software in the commerical space at least.

  • While programming in FORTH, you have a similar advice: avoid stack mumbling (aka copy and store ops).

    But sometimes all that stack changes could be avoided by using a temp register.

I feel event sourcing is a real world pragmatic approach to declarative programming that this paper advocates.

For state changes you add events to the database to describe something that happened. Any question you may need an answer for / business decision you want to make can be answered by querying the events.

The problem at the moment is that while event sourcing is excellent at reducing accidental complexity surrounding implementing business rules, there is little standard / commonly used tooling around it and you end up with lots of accidental complexity in that end.

An example would be a database not designed to be a CRUD store but to store events and manage read models, and manage computation of projections etc -- while being suitable for OLTP workloads. At a minimum, very strong support for using any kind of construct in materialized views (since in a sense the entire business logic is written as a "materialized view" when doing event sourcing)

  • Event Sourcing is a nice card to have in your hand, but it should not be a goal in itself. A yard is 3 feet. The atomic mass of Hydrogen is 1.008 and its symbol is "H". These will never change. If someone came to me and said "your 'Chemical' table is not event-sourced. We're doing event-sourcing here. You need to change it." I would tell them to get lost. Why the hell would you have an event for "An Element's Mass was Modified" event stored in your database? Unless you're developing for CERN or NREL or something, just don't.

    On the other hand, having a bank account table with a single field for someone's money is clearly not enough. You absolutely should be tracking every transaction that has changed that account's value. Do you need to track every other possible change to that account? Like, whether they want paper or electronic mail? No, probably not.

    "Event sourcing" is a way to refactor a domain model -- take a statement and break it into a sequence of statements that "add up" to the original statement.

    "Add up" is key here. When you break "AccountBalance" into "Transaction", it's clear how to "add up" transactions to recreate the original account balance. But that's not your goal, necessarily! The reason why this tends to make better domain models is exactly because you have to think about "adding up" your domain models, and what that means. "Adding up" is an associative, probably-commutative, binary operation with identity. Note that that means your domain MUST have a "zero transaction". ALL of your events that you event source need a "zero event". If you cannot come up with the "zero" of an event, then you should not be breaking into events! The whole point is to be able to define the monoid over it, which requires identity.

    So instead of taking event sourcing as an end goal, make your goal this: think about the operations that make sense on your domain, like accumulating accounts. What else can you accumulate? Can you add Employees together? Not really -- you can group them, into departments and events and meetings. Is that grouping associative and commutative? Sure -- it's just Set Union. Is there anything about Employees you can add together? Well, their salaries. In fact for data analysis, employee salaries are an important cost that you probably want to throw in a data cube. Define a Monoid over Employee salaries.

    What other operations make sense on your data? Close, open, start, end, group, buy, sell, move, rotate, add, multiply, concatenate, join, reverse, inverse, combine, merge, undo, redo, fill, empty, saturate, fix, break, import, export, report, validate. Are they associative, commutative, distributive, invertible? Do they have identity? Event sourcing is such a tiny part of exploring this world. And it's worth exploring.

    • I'm not arguing that event sourcing should be done for its own sake, so I don't really want to disagree with you; but that said your post doesn't perfectly resonate with me either.

      When you write a typical backend system, the desired function of the system is to interact with the external world. Without I/O the system may as well not exist.

      Input is a desire from someone that something be done or recording that something happened. Such input changes the data recorded, or appends what the system should know / has seen. This is an "event".

      All input can be framed as being an event. But "an element's mass was modified" is not an event ... it doesn't describe someone or something giving input to our system.

      The algebraic view on things you take seems to be treating the system at a different level than what I think about as event sourcing.

      Neither "an element's mass was modified" or "sell" or "transaction" that you mention are realistic events. An event is "User U clicked a button in the web console to Sell share S at time T". Implementing the effects of that event -- computing a specific read model resulting from appending the event to the stored set of events -- may well be best done by some algebra like you suggest, but that seems like another topic.

      You seem to talk about models for computing and transforming state. I talk about I/O vs data storage.

      4 replies →

In the past, I have been unimpressed by this paper. Perhaps someone can shed some historical context...

But from my perspective what happens is that the paper defines complexity in exactly the way that allows them to deride OO, FP, etc programming whilst simultaneously showing how awesome functional relational programming is. It ignores complexity that's orthogonal to what FRP addresses and ignores areas in which FRP itself contributes to unnecessary complexity.

It feels like a scenario where the authors had something that they thought was neat and went out to create metrics that would in fact show that it was neat. Maybe FRP is really neat, but I feel that the paper itself doesn't contribute to anything because its logic is so custom and purpose built.

  • I think the first half of the paper ("the problem") is much stronger than the latter half ("the solution").

A classic paper with a dream that has yet to be realized. We continue to bolt state on top of state. Redux bolted on top of GraphQL on top of Redis on top of Postgres. We can do better.

  • It continues to be such a sad state of affairs. We are constantly reinventing and repackaging well known bad practices.

    I recall getting publicly lambasted on here for daring to question the wisdom of an ActiveRecord-like data layer for a front-end SPA framework.

    • Oh right, I forgot the ORM with "distributed caching" needed between the database and GraphQL layers. More hidden state.

I think the crux of this paper is section 7.2.2:

> There is one final practical problem that we want to consider — even though we believe it is fairly rare in most application domains. In section 7.1.1 we argued that immutable, derived data would correspond to accidental state and could be omitted (because the logic of the system could always be used to derive the data on-demand). Whilst this is true, there are occasionally situations where the ideal world approach (of having no accidental state, and using on-demand derivation) does not give rise to the most natural modelling of the problem. One possible situation of this kind is for derived data which is dependent upon both a whole series of user inputs over time, and its own previous values. In such cases it can be advantageous to maintain the accidental state even in the ideal world. An example of this would be the derived data representing the position state of a computer-controlled opponent in an interactive game — it is at all times derivable by a function of both all prior user movements and the initial starting positions, but this is not the way it is most naturally expressed.

Emphasis is mine.

I think that this type of derived data, which I put in italics above, is quite common - contrary to what the authors of the paper argue. Any UI code or game-like system, like simulations, will have this kind of data. And the paper does not have a good answer for it. I honestly think that nobody has an answer for it, and it's why most of our UIs suck.

I would love to see something that makes handling this type of derived data easy.

  • You should look into the "comonad" abstraction from the functional programming world. Dual to monads, they're a natural fit for situations where you might have a value with some sort of (possibly infinite) context (think: neighborhood, or history, etc.) that can be either pre-computed or computed on-demand.

    This StackOverflow post[1] is a good starting point for understanding comonads. It points out that they can be used to model cellular automata (like Conway's Game of Life), sequences, streams, etc.

    [1] https://stackoverflow.com/questions/8428554/what-is-the-como...

I came across this after seeing relic[0] submitted the other day and thought it was pretty interesting.

I've been into CRDTs for a while and have started wondering about generic mechanisms for distributed data. This lead me to read a lot more about the Relational Model of data and eventually to the Event Calculus.

What's interesting to me is that these things end up feeling a lot like CRDTs[1] or Event Sourcing. I haven't quite finished pulling on these threads but the relic link was a timely read considering!

I really liked the first half of this paper and the Authors categorization of complexity. However the second half fell a bit short for me. It seems they made the same mistake as many other people (SQL != Relational) and their idea of Feeders and Observers seems a bit more like an escape hatch than an elegant method for interfacing with the outside world.

[0] https://github.com/wotbrew/relic [1] http://archagon.net/blog/2018/03/24/data-laced-with-history/

I'm designing a such a system now, based on the pure functional Joy language with two additional data stores: a relational db system (Prolog (or maybe Datalog), not SQL) and what is effectively a git repo although I think of it as a "data oracle".

The role of the relational db system is explained in TFA. (Prolog makes a fine Relational Model DB (the "relations" in RMDBs are the same logical relations that Prolog "relations", um, are.) The language is cleaner and simpler and more powerful than stock SQL, there's an ISO standard and several solid implementations, and you can always back it up with SQLite or PostGRES or whatever if you need to.) The trick to integrating it with a purely functional system is to only use "pure and monotonic Prolog code" which you want to do anyway ( https://news.ycombinator.com/item?id=34936729

This paper was career changing for me.

Chapter 9 is effectively the original concept for our business rules engine today. We use SQLite and C# UDFs as the actual foundation.

Using a proper relational model and SQL queries for all the things means that domain experts can directly contribute. It also makes it feasible to turn your domain experts into internal customers of your developers. For B2B products, this can be make or break.

Building an actual business around this kind of thing is very hazardous in my experience. Unless you are completely certain that you have the schema figured out, this promising foundation converts to quicksand.

Using higher normal forms is one way to reduce the consequence of screwing up domain modeling, but you still really have to know the relational guts of the business one way or another.

One design-time trick is to get all stakeholders to compose their ideal schema in something like excel and have them fill in some sample data. Showing them an example of one of these for a different domain is a powerful lesson in my experience.

My impression from reading this has always been "Almost there! Keep going." If you keep pulling on these threads, I claim that you reach some inevitable conclusions. These ideas are a superset of all the other advice they give (and most other advice you hear, too):

1. Separate identity from data. These are completely different things. The number 7 is not mutable, even though my age is. Keep going! The phrase "It was a dark and stormy night" is not mutable, even though the opening line of your book is. Keep going! The statement "there are seventeen cars parked on level 3 and four spots available" is not mutable, even though the status of the parking garage is. Identity is immutable, and statements (data) are also immutable. Only the assignment of a statement to an identity is mutable.

2. Think about the operations that make sense on your data. Sure, you have "map", "filter", "reduce" that are mathematically pure. What about "offset" or "rotate"? Those have identity, they are associative, individually commutative (but not with each other!). Is there a distributive property that applies there? Okay, what about "buy" and "sell" operating on commodities? Are those mathematical operations? Do they have an identity element? Are they associative and commutative? "Buy 2000 lbs of chicken" -- is that equivalent to "Buy 1 ton of chicken"? What are its domain and range? Not "chicken", nor even "warehouses" -- you're not teleporting the chicken as soon as you commit that record. More like "contract", or even "contract request". What can you do with contracts? Is there an "identity contract"? "Buy 0 chicken"? Zero is a useful number! Is "Buy 0 chicken" the same as "Buy 0 beef"? Explore these questions. Find the math around your domain. Functional purity is all good, but it's wasted if your domain is just "Chicken : Meat : Commodity", "Beef : Meat : Commodity", "Pine : Lumber : Commodity". Wrong. Don't think about nouns. Think about sensible pure operations in your domain. Successive abstraction layers should be restricting what's possible to do with your data, by defining higher and higher-level operations. If your abstraction layers aren't restricting what's possible to do, they're an inner platform and you don't need them.

3. Don't make big decisions upfront. That's how you add accidental complexity. Don't make any decisions you don't need to, and prefer solutions that allow you to defer or lighten decisions. If you follow this thread, that means you're using a relational model. They absolutely got that right (section 8). Otherwise you're making decisions about ownership that you don't need to be making. Domain Driven Design has it wrong here with aggregate roots. What's an aggregate? Which concept goes inside of which aggregate? You do not need to make that decision. The authors of TFA get it right, and then wrong again, when they try to apply a relational model to shitty noun-based domain ideas like "Give Employee objects a reference to their Department, vs. Give Department objects a set (or array) of references to their Employees vs. Both of the above". No. They're not following their own advice. Can an employee exist without the concept of "department"? Yes. Can a department exist without the concept of employees? Probably. Therefore you must encode the relationship separately from both. Your hands are tied. If concept A can exist independently of concept B, you must not draw an arrow from A to B. The answer is not "both", it's "neither". An employee's assignment within a department is its own independent concept, that knows about "employee" and "department" both. And now it's obvious you can give this concept its own tenure and add a new record whenever they change departments. You've found append-only event sourcing without needing to start there -- it came from first principles (in this case).

4. Operate on the broadest scope you can. Manipulate sets of things, not individual things. This is half of functional programming's usefulness right here. This is supported by using a relational model. Operate on sequences of things, not individual things: there's reactive programming. How else can you broaden the scope of what you're operating on?

5. Don't just do something, stand there! That is: don't do, plan. Instead of immediately looping, just use "map". Instead of immediately hitting the database, write a query plan. This doesn't mean offload your thinking to GraphQL -- that's just kicking the can down the road. See point #2. This is the other half of functional programming's usefulness. Do only as much as you need to make (or append to) a plan, and then stop. Someone else will execute it later -- maybe! You don't care.

There's probably more, but there are more fundamental ideas than "use functional programming" or "use event sourcing" or "use SOLID" -- first principles that actually lead to almost all the other good advice you get. This paper kinda almost gets there a couple times, then walks back again. Keep going. Keep pulling on those threads. I suspect that the more you do this, the more your code will change for the better. But you have to be willing to keep going -- don't half-ass it. If you only go part way and then stop, the benefits won't be apparent yet.

There are a lot of comments saying that we can't avoid mutable state and dismiss this paper entirely.

I find a practical interpretation of this paper is:

- Favor pure functions over impure functions

- Reduce mutable state and be deliberate about where those mutations have to happen

- Prefer derived state over keeping state in sync

There are always exceptions. Use your judgment on when this simplifies code and when it doesn't.

Does anyone have this in audio form? I dont have a ergonomic way to read this even though it looks really on point