Skip to content

feat: add support for big-endian systems and big-/little-endian interop#36

Merged
peterboncz merged 1 commit intocwida:masterfrom
mhx:mhx/big-endian-interop
Oct 7, 2025
Merged

feat: add support for big-endian systems and big-/little-endian interop#36
peterboncz merged 1 commit intocwida:masterfrom
mhx:mhx/big-endian-interop

Conversation

@mhx
Copy link
Contributor

@mhx mhx commented Aug 6, 2025

Before this change, fsst did not even work on big-endian systems:

$ ./binary /usr/share/dict/words words.fsst
Compressed 2486824 bytes into 3239956 bytes ==> 130%
$ ./binary -d words.fsst words.dec
Decompressed 3239953 bytes into 2486824 bytes ==> 76%
$ head /usr/share/dict/words
A
a
aa
aal
aalii
aam
Aani
aardvark
aardwolf
Aaron
$ head words.dec

A
v
wo
Ao
Aoc
Aoc
Aot

With this change, it works correctly on big-endian systems and delivers the exact same result as on little-endian systems. Furthermore, the symbol tables produced by fsst_export() will always use little-endian version headers, and fsst_import() will always expect little-endian version headers, regardless of which system the code is running on. This enables symbol table exchange between big- and little-endian systems, as the remainder of the symbol table is byte-order-agnostic.

The change is fully backwards-compatible.

On little-endian systems, the code should behave exactly as before. On big-endian systems, the numeric 64-bit value of a symbol will be swapped as needed and will always be stored as little-endian. There is certainly some overhead in doing this, but it is much better than not being able to use fsst at all on big-endian systems.

Before this change, fsst did not even work on big-endian systems:

```
$ ./binary /usr/share/dict/words words.fsst
Compressed 2486824 bytes into 3239956 bytes ==> 130%
$ ./binary -d words.fsst words.dec
Decompressed 3239953 bytes into 2486824 bytes ==> 76%
$ head /usr/share/dict/words
A
a
aa
aal
aalii
aam
Aani
aardvark
aardwolf
Aaron
$ head words.dec

A
v
wo
Ao
Aoc
Aoc
Aot
```

With this change, it works correctly on big-endian systems and delivers
the exact same result as on little-endian systems. Furthermore, the
symbol tables produced by `fsst_export()` will always use little-endian
version headers, and `fsst_import()` will always expect little-endian
version headers, regardless of which system the code is running on. This
enables symbol table exchange between big- and little-endian systems, as
the remainder of the symbol table is byte-order-agnostic.

The change is fully backwards-compatible.

On little-endian systems, the code should behave exactly as before. On
big-endian systems, the numeric 64-bit value of a symbol will be swapped
as needed and will always be stored as little-endian. There is certainly
some overhead in doing this, but it is much better than not being able
to use fsst at all on big-endian systems.
@peterboncz
Copy link
Collaborator

Thanks for this PR! And apologies for not noticing this earlier..

It would be good to also int64swap code when symbol tables are serialized. I propose to serialize in little endian, given the dominance of that byte order. Note that many applications (e.g. file formats) serialize FSST symbol tables and files should be readable across platforms.

@mhx
Copy link
Contributor Author

mhx commented Oct 7, 2025

Thanks for this PR! And apologies for not noticing this earlier..

No worries!

It would be good to also int64swap code when symbol tables are serialized. I propose to serialize in little endian, given the dominance of that byte order. Note that many applications (e.g. file formats) serialize FSST symbol tables and files should be readable across platforms.

I believe my change already does exactly that. :)

In fact, it should always treat symbol tables as little endian, regardless of the architecture. All symbols loads are stores will swap on big-endian, and be a no-op on little-endian: https://github.com/cwida/fsst/pull/36/files#diff-4f60b51dffa00f29fd7202953579c52f888af318e0a657b973ed8114330a956fR103-R104

At least, I can successfully interchange symbol tables between big-/little-endian using just fsst_export and fsst_import, without any additional processing of the buffer.

Please let me know if I'm missing something.

@mhx
Copy link
Contributor Author

mhx commented Oct 7, 2025

To back up my claim:

I'm using FSST in DwarFS to store lookup tables for file names and symbolic links in the file system image metadata (with really good compression ratios of often better than 2x).

I've tested that file system images built on little-endian produce the exact same file system representation on both big- and little-endian systems (there are also unit tests for that). Without this PR, this didn't work at all.

@peterboncz peterboncz merged commit 89f49c5 into cwida:master Oct 7, 2025
@peterboncz
Copy link
Collaborator

I see, thanks. Nice to hear DwarFS uses FSST!

Mytherin added a commit to duckdb/duckdb that referenced this pull request Dec 22, 2025
The PR fixes big-endian related issues with these:
- FSST compression ("gracefully" merges the latest of
https://github.com/cwida/fsst to pick up [BE
support](cwida/fsst#36));
- arrow conversion;
- GEOMETRY and HASH types;
- md5 functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants