I'm genuinely surprised that there isn't column-level shared-dictionary string c...

solatic · 2026-02-02T11:09:26 1770030566

> Is it just really really hard to maintain a shared dictionary when constantly adding and deleting values? Is there just no established reference algorithm for it?

Enums? Foreign key to a table with (id bigint generated always as identity, text text) ?

> I have databases I know would be reduced in size by at least half.

Most people don't employ these strategies because storage is cheap and compute time is expensive.

ww520 · 2026-02-02T02:06:04 1769997964

Strings in textual index are already compressed, with common prefix compression or other schemes. They are perfectly queryable. Not sure if their compression scheme is for index or data columns.

Global column dictionary has more complexity than normal. Now you are touching more pages than just the index pages and data page. The dictionary entries are sorted, so you need to worry about page expansion and contraction. They sidestep the problems by making it immutable, presumably building it up front by scanning all the data.

Not sure why using FSST is better than using a standard compression algorithm to compress the dictionary entries.

Storing the strings themselves as dictionary IDs is a good idea, as they can be processed quickly with SIMD.

randomuser47 · 2026-02-02T06:51:00 1770015060

> Not sure why using FSST is better than using a standard compression algorithm to compress the dictionary entries.

I believe the reason is that FSST allows access to individual strings in the compressed corpus, which is required for fast random access. This is more important for OLTP than OLAP, I assume. More standard compression algorithms, such as zstd, might decompress very fast, but I don't think they allow that

andersmurphy · 2026-02-02T08:40:41 1770021641

In the case of sqlite you can just use ZFS and get page level compression.

hinkley · 2026-02-02T00:25:34 1769991934

There are some databases that can move an entire column into the index. But that's mostly going to work for schemas where the number of distinct values is <<< rowcount, so that you're effectively interning the rows.

analyst74 · 2026-02-02T00:28:59 1769992139

compression is not free, dictionary compression:

1, complicates and slows down update, which is typically more important in OLTP than OLAP

2, is generally bad for high cardinality columns, which requires tracking cardinality to make decisions, which further complicates things.

lastly, additional operational complexity (like the table maintenance system you described in last paragraph) could reduce system reliability, and they might decide it's not worth the price or against their philosophy.

groundzeros2015 · 2026-02-02T05:28:47 1770010127

How do you layout all that variable length data in memory m?

pstuart · 2026-02-02T03:51:54 1770004314

Duckdb can also handle SQLite files: https://duckdb.org/docs/stable/core_extensions/sqlite