← Back to context

Comment by evanelias

7 hours ago

Why did you need to index fairly long strings in their entirety in a way that preserves collation behaviors?

And why is a 768 character limit woefully bad, but a 2704 character limit is totally fine?

A URL, for instance, can't be safely stored in 768 characters, but it can be stored safely in 2704. If you then wanted to sort those URLs so that all URLs for each domain and path within that domain are adjacent, you need an index. Especially if you want to paginate over them with a cursor. Doing that without an index on the raw value is a royal pain in the ass.

Hell, even just being able to sort user-submitted strings up to a kilobyte. Why up to a kilobyte? Some users have strings that are kind of long. If I have to define a second column that's the truncated prefix, that's just a silly waste of space because MySQL decided to use utf-32 under the hood.

  • To be honest, indexes aren't designed for that. They're meant for fast lookup of short identifiers. Things like people's names and product ID's. Not long URL's. It's not performant.

    If you need to keep a million long URL's in a defined sort order, my first recommendation would be, don't -- see if there's another way to achieve your end result. But if you absolutely have to, then create a new integer column to be your sort key, and use a little bit of extra code to give it values that produce the same sort order.

    Creating short numerical primary keys for long strings is a common database technique.

  • > it can be stored safely in 2704

    No, it can't. URL doesn't have any length limit, regardless of the fact that different software will impose different limits.

  • > A URL, for instance

    VARBINARY is typically ok for that I'd think? Then you can utilize the full 3072 byte limit for the key, since there's no character set applied.

    > even just being able to sort user-submitted strings up to a kilobyte

    As a software engineer, I completely agree. But as a DBA, I am obligated to make a "tsk" sound and scowl disapprovingly!

You might just load someone else's data, and the index is desirable in general for speeding up analytic queries. It's possible to work around that, of course. But depending on what you do, it can make writing efficient queries against the data more difficult. That's just a distraction because most of the time, those long columns won't matter anyway.