Jepsen: PostgreSQL 12.3

4 years ago (jepsen.io)

Personally, this kind of thing actually gives me _more_ confidence in Postgres rather than less. The core team's responsiveness to this bug report was incredibly impressive.

Around June 4th, the article's author comes in with a bug report that basically says "I hammered Postgres with a whole bunch of artificial load and made something happen" [1].

By the 8th, a preliminary patch is ready for review [2]. That includes all the time to get the author's testing bootstrap up and running, reproduce, diagnose the bug (which, lest us forget, is the part of all of this that is actually hard), and assemble a fix. It's worth noting that it's no one's job per se on the Postgres project of fix this kind of thing — the hope is that someone will take interest, step up, and find a solution — and as unlikely as that sounds to work in most environments, amazingly, it usually does for Postgres.

Of note to the hacker types here, Peter Geoghegan was able to track the bug down through the use of rr [4] [5], which allowed an entire problematic run to be captured, and then stepped through forwards _and_ backwards (the latter being the key for not having to run the simulation over and over again) until the problematic code was identified and a fix could be developed.

---

[1] https://www.postgresql.org/message-id/CAH2-Wzm9kNAK0cbzGAvDt...

[2] https://www.postgresql.org/message-id/CAH2-Wzk%2BFHVJvSS9VPP...

[3] https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQ...

[4] https://en.wikipedia.org/wiki/Rr_(debugging)

[5] https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQ...

  • Yeah, the PostgreSQL team really knocked it out of the park on this one. It was a pleasure working with them. :)

    • To be fair, you have a great batting average in identifying issues to allow for improvement. Thanks for your work

    • With distributed and multicore being the path forward with the end of Moores law, your work has been instrumental in helping open source distributed systems improve.

      Since distributed systems are so difficult and complicated, it enables salespeople and zealots to both deny issues and overstate capability.

      Your work is a shining star in that darkness. Thank you.

  • Thank you for this comment that gives credit where it's due, this is a very impressive set of threads to read through.

    And I agree. For me, one of the most important measures of the reliability of a system is how that system responds too information that it might be wrong. If the response is defensiveness, evasiveness, or persuasive in any way, i.e. of the "it's not that bad" variety, run for the hills. This, on the other hand is technical, validating, and prompt.

    Every system has bugs, but depending on these cultural features, not every system is capable of systematically removing those bugs. With logs like these, the pg community continues to prove capable. Kudos!

    • >If the response is defensiveness, evasiveness, or persuasive in any way, i.e. of the "it's not that bad" variety, run for the hills. This, on the other hand is technical, validating, and prompt.

      This resonates with me with teams inside the company as well.

      We have a few teams that just deflect issues. Find any issue in the bug report, be it an FQDN in a log search, and poof it goes. Back to sender, don't care. Engineers in my team just don't care to report bugs there anymore, regardless how simple. Usually, it's faster and less frustrating to just work around it or ignore it. You could be fighting windmills for weeks, or just fudge around it.

      Other teams, far more receptive with bugs.. engineers end up curious and just poke around until they understand what's up. And then you have bug reports like "Ok, if I create these 57 things over here, and toggle thing 32 to off, and then toggle 2 things indexed by prime numbers on, then my database connection fails. I've reproduced this from empty VMs. If 32 is on, I need to toggle two perfect squares on, but not 3". And then a lot of things just get fixed.

    • If you reject the "it's not that bad" response then you are saying all bugs are equal in severity / urgency / priority.

      Which is obviously not true.

      A better response might be something like "I understand the problem is X however the fix might take Y months and the next release will deprecate that feature anyway. Can you quantify or guesstimate the impact on our product if we allow this bug to stay unfixed?"

      1 reply →

    • Practically everyone in distributed systems knows who Aphyr is, and to get a bug reported by him is like a badge of honor. I am willing to bet that people at PostgrSQL jousted over who got to work with him on this bug, because it would be such an amazing learning opportunity. I know I would want to!

  • Indeed — it's great to see a vendor (team, in this case) that doesn't try to downplay a Jepsen result, and instead fixes the issues.

    However, there is one more takeaway here. I've heard too many times "just use Postgres", repeated as an unthinking mantra. But there are no obvious solutions in the complex world of databases. And this isn't even a multi-node scenario!

    • > there is one more takeaway here

      I don't think the "just use Postgres" mantra takes any hits at all from this. (If anything, I feel better about it).

      I've used maybe a dozen (?) databases/stores over the years - graph databases, NoSQL databases, KV stores, the most boring old databases, the sexiest new databases - and my general approach is now to just use Postgres unless it really, really doesn't fit. Works great for me.

      8 replies →

    • "Use PostGres until you have an engineering - data driven rationale not to" is my standard answer for non-blob data storage when a project starts.

      Why? because when `n` is small (tables, rows, connections), postgres works well enough, and if `n` should ever become large, we'll have interest in funding the work to do a migration, if that's appropriate - and we'll be able to evaluate the system at scale with real data.

      4 replies →

    • "I've heard too many times 'just use Postgres', repeated as an unthinking mantra."

      You use "unthinking" pejoratively, but being able to skip past some decisions without over-analyzing is really important. If you are an 8-person startup, you don't have time for 3 of the people to spend weeks discussing and experimenting with alternatives for every decision.

      Databases are really important, but people make tons of important decisions based on little information. If you have little information other than "I need to pick a database", then Postgres is a pretty good choice -- a sensible default, if you will.

      Everyone wants to be the default, of course, so you need some criteria to choose one. It could be a big and interesting discussion, but there are relatively few contenders. If it's not free, it's probably out. If it's not SQL, it's probably out. If it's not widely used already, it's probably out. If it's tied to any specific environment/platform (e.g. Hadoop), it's probably out (or there will be a default within that specific platform). By "out", I don't mean that it's unreasonable to choose these options, just that they would not make a sensible default. It would be weird if your default was an expensive, niche, no-SQL option that runs inside of Hadoop.

      So what's left? Postgres, MySQL, and MariaDB. SQL Server and MongoDB, while not meeting all of the criteria, have a case for being a default choice in some circles, as well. Apparently, out of those options, many on HN have had a good experience with Postgres, so they suggest it as default.

      But if you have additional information about your needs that leads you to another choice, go for it.

    • "Just use Postgres" may have become meme but for good reason and is well grounded IMO.

      Many immature databases with not much wide use are better avoided though, we manage to break datomic three times during development, the first two bugs were fixed in a week, the third took a month, which they called in their changelog "Fix: Prevent a rare scenario where retracting non-existent entities could prevent future transactions from succeeding" so yeah, we went back to "just use postgres", who wants to go through the nightmare of hitting those bugs in production and who knows how many more?scary situation.

  • Thanks for this summary. I take for granted that I have a Postgres, powerful And reliable database that I get to use for free in all my projects and work.

This postgresql mailing list thread allows you to read along with the PostgreSQL developers and Jepsen, seems like a very useful discussion: https://www.postgresql.org/message-id/flat/db7b729d-0226-d16...

  • This is just such a pleasure to read, even as someone that has only surface awareness of database internals at all. Both for the incredibly friendly and professional tone, and for the obvious deep technical knowledge on both sides.

    And that first email, my god, that should be titanium-and-gold-plated standard of a bug report.

    • > that first email, my god, that should be titanium-and-gold-plated standard of a bug report.

      It's a thing of beauty. It even includes versions of software used!

      My daily experience with bug reports are that they 50/50 won't even include a description, just a title. It's such a cliche already, but "project name is broken" makes my blood boil. What environment? What were you doing? Is this production? How do I test this bug? (from an Ops perspective) When did you notice this? Has anything changed recently to possibly cause an error?

      Arg, my blood pressure!

      /offtopic, sorry.

It is very rare to see a Jepsen report that concludes with a note that a project is being too humble about their consistency promises.

Finding effectively only a single obscure and now fixed issue where real-world consistency did not match the promised consistency is pretty impressive.

  • > Finding effectively only a single obscure and now fixed issue where real-world consistency did not match the promised consistency is pretty impressive.

    They also admitted, that testing framework cannot evaluate more complex scenarios with subqueries, aggregates and predicates. So it is possible, that PG consistency promises are spot on or maybe even overpromising.

  • The testing on Zookeeper found no problems, which was extremely impressive given how thorough Aphyr is.

    • Strong consistency is easier when throughput and concurrency are limited.

      Zookeeper is impressive in many ways. However, unless something changed drastically in the last two years, Zookeper throughput will always limit it to configuration/metadata/control-plane rather than a primary/data-plane use cases.

>PostgreSQL has an extensive suite of hand-picked examples, called isolationtester, to verify concurrency safety. Moreover, independent testing, like Martin Kleppmann’s Hermitage has also confirmed that PostgreSQL’s serializable level prevents (at least some!) G2 anomalies. Why, then, did we immediately find G2-item with Jepsen? How has this bug persisted for so long?

This is super interesting. Jepsen seems to be like Hypothesis for race conditions: you specify the race condition to be triggered and it generates tests to simulate it.

Yesterday, Gitlab acquired a fuzz testing company[1]. I wonder if Jepsen was envisioned as a full CI integrated testing system

[1] https://m.calcalistech.com/Article.aspx?guid=3832552

  • Yes. Jepsen and Hypothesis both descend from a long line of property-based testing systems--mostly notably, Haskell & Erlang's QuickCheck. Jepsen makes a number of unusual choices specific to testing concurrent distributed systems: notably, we don't do much shrinking (real-world systems are staggeringly nondeterministic). Jepsen also includes tooling for automated deploys, fault injection, a language for specifying complex concurrent schedules, visualizations, storage, and an array of sophisticated property checkers.

Reading through the source of Elle:

> "I cannot begin to convey the confluence of despair and laughter which I encountered over the course of three hours attempting to debug this issue. We assert that all keys have the same type, and that at most one integer type exists. If you put a mix of, say, Ints and Longs into this checker, you WILL question your fundamental beliefs about computers" [1].

I feel like Jepsen/Elle is a great argument for Clojure, reading the source is actually kind of fun. Not what you'd expect for a project like this.

[1]: https://github.com/jepsen-io/elle/blob/master/src/elle/txn.c...

  • Wonder if this "manual type constraints"-style code is pre-"spec"

    • Normally I'm a core.typed person, but static type constraints don't quite make sense here. We want heterogeneity in some cases (e.g. you want to be able to mix nils and ints), but not in others (e.g. this short and int mixing, which could be intentional, but also, might not be)

      I've considered spec as well, but spec has a weird insistence that a keyword has exactly one meaning in a given namespace, which is emphatically not the case in pretty much any code I've tried to verify. Also its errors are... not exactly helpful.

      3 replies →

    • Elle is pretty new so I would guess not--unless it's been lurking somewhere else. Dunno what aphyr's thoughts on spec are, plus I'm an amateur clojurian so, I'm not sure what community consensus is or if spec has drawbacks that make it not a good fit.

This is my understanding of what a G2-Item Anti-dependecy Cycle is from the linked paper example:

  -- Given (roughly) the following transactions:  

  -- Transaction 1 (SELECT, T1)
  with all_employees as (
    select sum(salary) as salaries
    from employees
  ),
  department as (
    select department, sum(salary) as salaries
    from employees group by department
  )
  select sum(all_employees.salaries) - sum(department.salaries);

  -- Transaction 2 (INSERT, T2)
  insert into employees (name, department, salary)
  values ('Tim', 'Sales', 70000);

  -- G2-Item is where the INSERT completes between all_employees and department,
  -- making the SELECT result inconsistent 

This is called an "anti-dependency" issue because T2 clobbers the data T1 depends on before it completes.

They say Elle found 6 such cases in 2 min, which I'm guessing is a "very big number" of transactions, but can't figure out exactly how big that number is based on the included logs/results.

Also, "Elle has found unexpected anomalies in every database we've checked"

  • Yeah, it was relatively infrequent in that particular workload--dramatically less than PostgreSQL's "repeatable read" exhibited. These histories are roughly 15K successful transactions long--see the stats field in results.edn. I'm hesitant to make strong statements about frequency, because I suspect this kind of thing depends strongly on workload, but I would hazard a gueesssss that it's not super common.

Props to Jensen for exposing this longtime bug. Props to the PG team for identifying the culprit and their response. This report just strengthens my faith in the project.

It would be great to see Jepsen testing on distributed Postgres as this is a single node issue they've found here. In prod don't folks run HA?

  • I started this analysis intending to do just that--it's been difficult, however, to figure out which of the dozens of replication/HA configurations to actually test. I settled on Stolon, since it seemed to make the strongest safety claims. However, I found bugs which turned out to be PostgreSQL's fault, so I backed off to investigate those first.

    • And herein lies the rub: HA Postgres is an extremely painful proposition. Based on our non-scientific research, Patroni seems to be the most battle tested solution, and as popular if not more so than Stolon.

      25 replies →

We laughed when this happend to MongoDB.

The difference though is the reaction from the vendor.

  • Fun story: after the last report which called them out for not talking about write loss by default, MongoDB updated their Jepsen page to say that the analysis didn't observe lost writes. I guess they assumed that people wouldn't... read the abstract? Let alone the report?

  • For me, MongoDB has track record of bolstering a lot ("webscale") and hiding/denying mistakes.

    PostgreSQL is quite the opposite on that front, confident yet open to critics and abble to admit mistakes. Hell, I've even them present their mistakes at conferences and ask for help.

    • Yes, for instance not returning errors in some cases when writes fail. I think this was until version 2 but to be fair they fixed this kind of stuff and started to deal with this differently later on. However their reputation never fully recovered from this.

> Neither process crashes, multiple tables, nor secondary-key access is required to reproduce our findings in this report. The technical justification for including them in this workload is “for funsies”.

Always read the footnotes!

By the way: where does the Jepsen name come from?

I have wondered more than once and my browsing and searching skills are failing me on this one.

Edit: The closest link I can find is "Call me maybe" but I am not able to find a causation or even a direct link or mention for now.

I am still wondering when we will see PostgreSQL being tested in a HA form.

It's just extraordinary to me that it's 2020 and it still does not have a built-in, supported set of features for supporting this use case. Instead we have to rely on proprietary vendor solutions or dig through the many obsolete or unsupported options.

  • There is a built-in supported set of features for high availability. What exactly are you missing?

    • The option to install postgres on three instances, specify that they're in cluster "foo" and then it just works, including automatically fixing any issues when one of the instances drops out and rejoins.

      That's what other DBs have but it seems to be missing from postgres. If it now exists could you point me to the doc explaining how to do this?

    • Stolon or an equivalent way of automatically deploying on Kubernetes with an Helm chart being officially blessed by the PostgreSQL team and made part of the official distribution.

      Also, same for a multi-master solution.

What does this really mean? I just migrated from mongo to Pg.

  • If you came from mongo that means everything will work far more reliably than you're used to.

    • This test only applies to a single instance of PostgreSQL.

      If you're looking for HA or need to shard then it's reliability is in question since it's never been tested.

  • There were edge cases in PostgreSQL’s SERIALIZABLE isolation level - which is supposed to ensure that concurrent transactions behave as if they were committed sequentially.

    Specifically - if running a transaction as SERIALIZABLE there was a very small chance that you might not see a rows inserted by another transaction that committed before you in the order. Many applications don’t need this level of transaction isolation - but for those that do it’s somewhat scary to know this was lurking under the bed.

    Every implementation of a “bank” system where you keep track of deposits and withdrawals is a use-case for SERIALIZABLE, and this means a double-spend could happen because the next transaction didn’t see an account just had a transaction that drained the balance, for example.

    Props to Jepsen for finding this.

    • The bank example is useful, because it tends to elicit the right thinking for people, but banking has a long history of eventual consistency.

      For the vast majority of the history of banking, local branches (which is a very loose term here, e.g. a family member of the guy you know in your hometown, rather than an actual physical establishment) would operate on local knowledge only. Consistency is achieved only through a regular reconciliation process.

      Even in more modern, digital times, banks depend on large batch processes and reconciliation processes.

    • I'd say MOST non trivial application require SERIALIZABLE. Every time apps does `BEGIN; SELECT WHERE; INSERT/UPDATE; COMMIT` it needs `SERIALIZABLE`, becuase it is only level catching cases, where concurrent transaction adds rows so that SELECT WHERE changes it's result set and therefore subsequent INSERT/UPDDATE should be done with different values.

    • The common bank example as I understand it doesn't require serializable, but only snapshot isolation: If two transactions both drain the source balance, the one that commits last will fail, because its snapshot doesn't match the state anymore.

      1 reply →

  • It's an isolation issue but if you're coming from Mongo I'd broadly guess it's not one you're going to trigger. Also, look at their other analyses ... they're very detailed and upfront about serialization isolation issues in a lot of huge databases/datastores.

    Noteworthy: "In most respects, PostgreSQL behaved as expected: both read uncommitted and read committed prevent write skew and aborted reads."

  • The default isolation level is read committed mode, whereas the bug in question only affected applications that use serializable mode. You have to ask for serializable mode explicitly; if you're not, then you cannot possibly be affected by the bug. (Perhaps you should consider using a higher isolation level, but that would be equally true with or without this bug.)

  • Was Jepsen a key contributor to your choice to migrate? Are you using PG in a distributed/replicated/HA mode like mongo?

    • -Yes but not the only one, was a succession of problems (why did i use mongo on the first place, on a transaction heavy callcenter database? Because the customer forced it because it was the only thing he knew)

      -No just a single huge instanced, managed on Azure

So this does not affect SSI guarantees if the transactions involved all operate on the same row? Is my understanding correct? For instance can I update a counter with serializable isolation and not run into this bug?

  • I think so, yeah. You could theoretically have a G2-item anomaly on a single key, but in PostgreSQL's case, the usual write-set conflict checking seems to prevent them.

For the repeatable read issue, I don't intuitively understand why the violation mentioned would be a problem. In particular, even though the transaction sequence listed wouldn't make sense for a serializable level, it seems consistent with what I'd expect from repeatable read (though I have not read the ANSI SQL standard's definition of repeatable read).

Any insights into why we should want repeatable read to block that? It feels like blocking that is specifically the purpose of serializable isolation.

  • The report talks at length about this, but maybe it bears rephrasing!

    The ANSI definitions are bad: they allow multiple interpretations with varying results. 25 years ago, Berenson, O'Neil, et al. published a paper showing the ANSI definitions had this ambiguity, and that what the spec meant to define should have been a broader class of anomalies. They literally say that the broad interpretation is the "correct" one, and the research community basically went "oh, yeah, you're right". Adya followed up with generalized isolation level definitions, and pretty much every paper I've read has gone with these versions since. That didn't make its way back into the SQL spec though: it's still ambiguous, which means you can interpret RR as allowing G2-item.

    Why prevent G2-item in RR? Because then the difference between repeatable read and serializable is specifically phantoms, rather than phantoms plus... some other hard-to-describe anomalies. If you use the broad/generalized interpretation, you can trust that a program which only accesses data by primary key, running under repeatable read, is actually serializable. That's a powerful, intuitive constraint. If you use the strict interpretation, RR allows Other Weird Behaviors, and it's harder to prove an execution is correct.

    For a very thorough discussion of this, see either Berenson or Adya's papers, linked throughout the report.

    • Thanks, I got the part about the spec being ambiguous, am more interested in the "why" aspect, since the current behaviour seems intuitive to the name "repeatable read". But on closer inspection, I see PostgreSQL's repeatable read blocks phantom reads even though the ANSI spec permits that! I don't get why phantom reads would be acceptable under "repeatable read"... I probably should give those papers a read some time. But in the meantime, given the choice of phantom reads or G2-item, I think I'd pick blocking phantom reads. (It might be nice to have the option to choose though!)

      In PostgreSQL's case, if they somehow made repeatable read to prevent G2-item without sacrificing the phantom reads, would that mean repeatable read is then "serializable" according to the ANSI definition?

      1 reply →

@aphyr could you please clarify this sentence?

> This behavior is allowable due to long-discussed ambiguities in the ANSI SQL standard, but could be surprising for users familiar with the literature.

Should that be "not familiar"? And which literature - the standard or the discussions?

  • Familiar. If you've read any of the literature on snapshot isolation or isolation levels in general (Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil, Adya, Liskov, Fekete, Cerone, Bailis, Alvisi, Clement, Crooks, Pu...) you would probably not expect to see SI treated as stronger than RR. The paper which first defined SI says it's not. Heck, Jim Melton, who wrote that part of the ANSI spec, is a co-author on that paper, which goes on to say the strict interpretation is incorrect and not what ANSI intended!

Thanks for doing these, they're incredibly interesting, useful, amusing (Oh no! the schadenfreude!) and also, incredibly inspiring to me to be a better engineer, so thank you again :)

Any plans to test any other NoSQL databases? I'm interested in MarkLogic

  • No, Jepsen is over, sorry!

    Jokes aside, Marklogic is welcome to pay me. Each one of these reports takes weeks to months of full-time work.

    • This is probably outside the purview of Jepsen but have you ever looked at Kdb, or the language it is build on top of? (K)

      Curious to hear your thought on it! Would love a Jepsen style analysis of kdb

    • If Oracle were smart they'd pay you to test theirs.

      Of course maybe this has already happened - and you are not able to discuss because of NDA. Which would be perfectly fine I think.

  • Imo don't trust a nosql database that hasn't done jepsen.

    It is a basic honesty test, because the uncertainty and difficult to reproduce things can be used for denialism by the proponents / salespeople.