← Back to context

Comment by evanelias

8 hours ago

Postgres limits btree keys to 2704 bytes, which is actually slightly smaller than MySQL's limit of 3072 bytes, assuming the default InnoDB storage engine.

That said, when using utf8mb4 in an index key, MySQL uses the "worst case" of each character being 4 bytes. So it effectively limits the max key size to 3072/4 = 768 characters, when a column is using the utf8mb4 character set.

For practical purposes, this doesn't cause much pain, as it's generally inadvisable to use complete long-ish strings as a key. And there are various workarounds, like using prefixes or hashes as the key, or using binary strings as keys to get the full 3072 bytes (if you don't need collation behaviors).

> So it effectively limits the max key size to 3072/4 = 768 characters, when a column is using the utf8mb4 character set.

This is exactly what I mean. 768 characters for an index is woefully bad. And for no obviously great reason: you can just index the encoded UTF-8 text.

This was literally reason why a former company (who will remain nameless) refused to add Unicode support. It's not even an imagined problem.

  • You should not be indexing 768 characters in any circumstance I can imagine. Go ahead and try it. Spin up two tables, fill them with a few million rows, and slap and index on them. Give one a reasonable prefix limit, and let the other go wild. Make sure you ANALYZE each, then run queries in a loop and check the times.

    Spoiler: I literally did this a couple of days ago. The index size bloat means that any possible savings you might have gained from collisions are obliterated from page fetches. I tested with a measly 128 characters vs. a prefix of 16, and that was enough for the average query time to be equal, with the smaller index winning for the minimum.

  • Why did you need to index fairly long strings in their entirety in a way that preserves collation behaviors?

    And why is a 768 character limit woefully bad, but a 2704 character limit is totally fine?

    • A URL, for instance, can't be safely stored in 768 characters, but it can be stored safely in 2704. If you then wanted to sort those URLs so that all URLs for each domain and path within that domain are adjacent, you need an index. Especially if you want to paginate over them with a cursor. Doing that without an index on the raw value is a royal pain in the ass.

      Hell, even just being able to sort user-submitted strings up to a kilobyte. Why up to a kilobyte? Some users have strings that are kind of long. If I have to define a second column that's the truncated prefix, that's just a silly waste of space because MySQL decided to use utf-32 under the hood.

      3 replies →

    • You might just load someone else's data, and the index is desirable in general for speeding up analytic queries. It's possible to work around that, of course. But depending on what you do, it can make writing efficient queries against the data more difficult. That's just a distraction because most of the time, those long columns won't matter anyway.