feat: add support for big-endian systems and big-/little-endian interop by mhx · Pull Request #36 · cwida/fsst

mhx · 2025-08-06T21:26:12Z

Before this change, fsst did not even work on big-endian systems:

$ ./binary /usr/share/dict/words words.fsst
Compressed 2486824 bytes into 3239956 bytes ==> 130%
$ ./binary -d words.fsst words.dec
Decompressed 3239953 bytes into 2486824 bytes ==> 76%
$ head /usr/share/dict/words
A
a
aa
aal
aalii
aam
Aani
aardvark
aardwolf
Aaron
$ head words.dec

A
v
wo
Ao
Aoc
Aoc
Aot

With this change, it works correctly on big-endian systems and delivers the exact same result as on little-endian systems. Furthermore, the symbol tables produced by fsst_export() will always use little-endian version headers, and fsst_import() will always expect little-endian version headers, regardless of which system the code is running on. This enables symbol table exchange between big- and little-endian systems, as the remainder of the symbol table is byte-order-agnostic.

The change is fully backwards-compatible.

On little-endian systems, the code should behave exactly as before. On big-endian systems, the numeric 64-bit value of a symbol will be swapped as needed and will always be stored as little-endian. There is certainly some overhead in doing this, but it is much better than not being able to use fsst at all on big-endian systems.

Before this change, fsst did not even work on big-endian systems: ``` $ ./binary /usr/share/dict/words words.fsst Compressed 2486824 bytes into 3239956 bytes ==> 130% $ ./binary -d words.fsst words.dec Decompressed 3239953 bytes into 2486824 bytes ==> 76% $ head /usr/share/dict/words A a aa aal aalii aam Aani aardvark aardwolf Aaron $ head words.dec A v wo Ao Aoc Aoc Aot ``` With this change, it works correctly on big-endian systems and delivers the exact same result as on little-endian systems. Furthermore, the symbol tables produced by `fsst_export()` will always use little-endian version headers, and `fsst_import()` will always expect little-endian version headers, regardless of which system the code is running on. This enables symbol table exchange between big- and little-endian systems, as the remainder of the symbol table is byte-order-agnostic. The change is fully backwards-compatible. On little-endian systems, the code should behave exactly as before. On big-endian systems, the numeric 64-bit value of a symbol will be swapped as needed and will always be stored as little-endian. There is certainly some overhead in doing this, but it is much better than not being able to use fsst at all on big-endian systems.

peterboncz · 2025-10-07T13:43:45Z

Thanks for this PR! And apologies for not noticing this earlier..

It would be good to also int64swap code when symbol tables are serialized. I propose to serialize in little endian, given the dominance of that byte order. Note that many applications (e.g. file formats) serialize FSST symbol tables and files should be readable across platforms.

mhx · 2025-10-07T14:51:10Z

Thanks for this PR! And apologies for not noticing this earlier..

No worries!

It would be good to also int64swap code when symbol tables are serialized. I propose to serialize in little endian, given the dominance of that byte order. Note that many applications (e.g. file formats) serialize FSST symbol tables and files should be readable across platforms.

I believe my change already does exactly that. :)

In fact, it should always treat symbol tables as little endian, regardless of the architecture. All symbols loads are stores will swap on big-endian, and be a no-op on little-endian: https://github.com/cwida/fsst/pull/36/files#diff-4f60b51dffa00f29fd7202953579c52f888af318e0a657b973ed8114330a956fR103-R104

At least, I can successfully interchange symbol tables between big-/little-endian using just fsst_export and fsst_import, without any additional processing of the buffer.

Please let me know if I'm missing something.

mhx · 2025-10-07T15:02:27Z

To back up my claim:

I'm using FSST in DwarFS to store lookup tables for file names and symbolic links in the file system image metadata (with really good compression ratios of often better than 2x).

I've tested that file system images built on little-endian produce the exact same file system representation on both big- and little-endian systems (there are also unit tests for that). Without this PR, this didn't work at all.

peterboncz · 2025-10-07T15:05:55Z

I see, thanks. Nice to hear DwarFS uses FSST!

The PR fixes big-endian related issues with these: - FSST compression ("gracefully" merges the latest of https://github.com/cwida/fsst to pick up [BE support](cwida/fsst#36)); - arrow conversion; - GEOMETRY and HASH types; - md5 functions.

mhx mentioned this pull request Aug 6, 2025

Mixed-endian decoding? #35

Closed

mhx mentioned this pull request Aug 19, 2025

Endless loop in buildSymbolTable on 32-bit ARM / gcc #37

Closed

peterboncz merged commit 89f49c5 into cwida:master Oct 7, 2025

DNikolaevAtRocket mentioned this pull request Dec 17, 2025

Big-endian patches (FSST, arrow, types, md5) duckdb/duckdb#20237

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for big-endian systems and big-/little-endian interop#36

feat: add support for big-endian systems and big-/little-endian interop#36
peterboncz merged 1 commit intocwida:masterfrom
mhx:mhx/big-endian-interop

mhx commented Aug 6, 2025

Uh oh!

peterboncz commented Oct 7, 2025

Uh oh!

mhx commented Oct 7, 2025

Uh oh!

mhx commented Oct 7, 2025 •

edited

Loading

Uh oh!

peterboncz commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mhx commented Aug 6, 2025

Uh oh!

peterboncz commented Oct 7, 2025

Uh oh!

mhx commented Oct 7, 2025

Uh oh!

mhx commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterboncz commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mhx commented Oct 7, 2025 •

edited

Loading