feat: add support for big-endian systems and big-/little-endian interop#36
Conversation
Before this change, fsst did not even work on big-endian systems: ``` $ ./binary /usr/share/dict/words words.fsst Compressed 2486824 bytes into 3239956 bytes ==> 130% $ ./binary -d words.fsst words.dec Decompressed 3239953 bytes into 2486824 bytes ==> 76% $ head /usr/share/dict/words A a aa aal aalii aam Aani aardvark aardwolf Aaron $ head words.dec A v wo Ao Aoc Aoc Aot ``` With this change, it works correctly on big-endian systems and delivers the exact same result as on little-endian systems. Furthermore, the symbol tables produced by `fsst_export()` will always use little-endian version headers, and `fsst_import()` will always expect little-endian version headers, regardless of which system the code is running on. This enables symbol table exchange between big- and little-endian systems, as the remainder of the symbol table is byte-order-agnostic. The change is fully backwards-compatible. On little-endian systems, the code should behave exactly as before. On big-endian systems, the numeric 64-bit value of a symbol will be swapped as needed and will always be stored as little-endian. There is certainly some overhead in doing this, but it is much better than not being able to use fsst at all on big-endian systems.
|
Thanks for this PR! And apologies for not noticing this earlier.. It would be good to also int64swap code when symbol tables are serialized. I propose to serialize in little endian, given the dominance of that byte order. Note that many applications (e.g. file formats) serialize FSST symbol tables and files should be readable across platforms. |
No worries!
I believe my change already does exactly that. :) In fact, it should always treat symbol tables as little endian, regardless of the architecture. All symbols loads are stores will swap on big-endian, and be a no-op on little-endian: https://github.com/cwida/fsst/pull/36/files#diff-4f60b51dffa00f29fd7202953579c52f888af318e0a657b973ed8114330a956fR103-R104 At least, I can successfully interchange symbol tables between big-/little-endian using just Please let me know if I'm missing something. |
|
To back up my claim: I'm using FSST in DwarFS to store lookup tables for file names and symbolic links in the file system image metadata (with really good compression ratios of often better than 2x). I've tested that file system images built on little-endian produce the exact same file system representation on both big- and little-endian systems (there are also unit tests for that). Without this PR, this didn't work at all. |
|
I see, thanks. Nice to hear DwarFS uses FSST! |
The PR fixes big-endian related issues with these:
- FSST compression ("gracefully" merges the latest of
https://github.com/cwida/fsst to pick up [BE
support](cwida/fsst#36));
- arrow conversion;
- GEOMETRY and HASH types;
- md5 functions.
Before this change, fsst did not even work on big-endian systems:
With this change, it works correctly on big-endian systems and delivers the exact same result as on little-endian systems. Furthermore, the symbol tables produced by
fsst_export()will always use little-endian version headers, andfsst_import()will always expect little-endian version headers, regardless of which system the code is running on. This enables symbol table exchange between big- and little-endian systems, as the remainder of the symbol table is byte-order-agnostic.The change is fully backwards-compatible.
On little-endian systems, the code should behave exactly as before. On big-endian systems, the numeric 64-bit value of a symbol will be swapped as needed and will always be stored as little-endian. There is certainly some overhead in doing this, but it is much better than not being able to use fsst at all on big-endian systems.