Add FSST as compression codec#62670
Conversation
|
Ulad seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
This is an automatic comment. The PR descriptions does not match the template. Please, edit it accordingly. The error is: More than one changelog category specified: 'New Feature', 'Improvement' |
|
Ref #34246 |
|
This is an automated comment for commit e6a7325 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
|
Option 1: Expand the ICompressionCodec interface, add an optional Filter parameter to the decompression method—a structure containing, for example, required_substrings. The codec pretends to decompress all data but has the right to decompress unnecessary strings as empty. The filter is created in the query interpreter (more precisely, during query analysis) and then is carried through all data reading interfaces down to decompression. Option 2: If a String (possibly Nullable, Array of String) uses the FSST codec, during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column. Without these ways to accelerate, the codec does not make much sense - for raw performance/compression rate, it is strictly worse than LZ4 or ZSTD. |
|
|
||
| void registerCodecFSST(CompressionCodecFactory & factory) | ||
| { | ||
| auto codec_builder = [&](const ASTPtr & arguments) -> CompressionCodecPtr |
There was a problem hiding this comment.
I guess we should restrict FSST to String and FixedString columns here. Check the registerCodec method src/Compression/CompressionCodecGorilla.cpp (which is a floating-point-value-only codec) how to do that.
Also, we should check that no further parameters were passed into CODEC FSST() when defined in SQL. Please see the registerCodec method in src/Compression/CompressionCodecGCD.cpp.
Both cases also need to have a negative SQL test, i.e. a test case in 02973_fsst_code_test_data.sql that expect an exception:
CREATE TABLE table_fsst_codec (n String CODEC(FSST('an_additional_argument'))) ENGINE = Memory; -- { serverError ILLEGAL_SYNTAX_FOR_CODEC_TYPE }|
|
||
| void updateHash(SipHash & hash) const override { getCodecDesc()->updateTreeHash(hash, /*ignore_aliases=*/true); } | ||
|
|
||
| static const int OUT_SIZE = 2281337; |
There was a problem hiding this comment.
Magic constant, please add a comment behind it how it was calculated.
|
|
||
| namespace DB | ||
| { | ||
|
|
There was a problem hiding this comment.
For my understanding: Why are we using the original FSST version and not the FSST12 variant?
There was a problem hiding this comment.
+ /// Implements FSST compression based on https://github.com/cwida/fsst
| { | ||
| public: | ||
| explicit CompressionCodecFSST() { setCodecDescription("FSST"); } | ||
|
|
There was a problem hiding this comment.
Could we mark the codec experimental for now? (ICompressionCodec::isExperimental)
| @@ -0,0 +1,19 @@ | |||
| option(ENABLE_FSST "Enable FSST (Fast Static Symbol Table)" ${ENABLE_LIBRARIES}) | |||
|
|
|||
| if (NOT ENABLE_FSST) | |||
There was a problem hiding this comment.
Note to myself: the AVX512 detection at the beginning of fsst_avx512.cpp looks x86-specific. Test what happens on ARM, possibly exclude fsst_avx512.cpp from compilation on non-x86 platforms.
rschu1ze
left a comment
There was a problem hiding this comment.
I am curious how compression ration and performance compares to ZSTD?
You could load any of these datasets https://clickhouse.com/docs/en/getting-started/example-datasets and re-compress the string columns with different codecs, then do full-column scans over then so they are decompressed.
src/CMakeLists.txt
Outdated
| endif () | ||
|
|
||
| target_link_libraries (clickhouse_common_io PRIVATE ch_contrib::lz4) | ||
| # target_link_libraries (clickhouse_common_io PRIVATE ch_contrib::fsst) |
| ) | ||
|
|
||
| if (TARGET ch_contrib::fsst) | ||
| dbms_target_link_libraries(PRIVATE ch_contrib::fsst) |
There was a problem hiding this comment.
Please try if things still build if we only link clickhouse_compression, i.e omit l. 422.
|
|
||
| namespace DB | ||
| { | ||
|
|
There was a problem hiding this comment.
+ /// Implements FSST compression based on https://github.com/cwida/fsst
|
|
||
| void updateHash(SipHash & hash) const override { getCodecDesc()->updateTreeHash(hash, /*ignore_aliases=*/true); } | ||
|
|
||
| static constexpr int out_size = 2281337; |
There was a problem hiding this comment.
Needs a comment which explains how the constant was selected.
|
|
||
| size_t len_out[rows_count]; | ||
| unsigned char * str_out[rows_count]; | ||
| size_t header_size{fsst_header_size + sizeof(rows_count) + sizeof(len_out) + (sizeof(size_t) * len_in.size())}; |
There was a problem hiding this comment.
Minor: Let's use standard = initialization instead of use uniform initialization. The former is how it is done in the rest of the codebase. Please change all such places in this file.
| } | ||
| } | ||
|
|
||
| UInt32 getMaxCompressedDataSize(UInt32 uncompressed_size) const override { return uncompressed_size + FSST_MAXHEADER; } |
There was a problem hiding this comment.
Let's move l. 110-113 up to l. 31 so all "metadata" functions are in one place.
|
|
||
| while (data != end) | ||
| { | ||
| UInt64 cur_len; |
There was a problem hiding this comment.
I am pleasantly surprised that ClickHouse serializes String columns as is consecutive strings instead as two arrays (strings + length). That, of course, simplifies the job of splitDataByRows.
|
|
||
| void doDecompressData(const char * source, UInt32 source_size, char * dest, UInt32 uncompressed_size) const override | ||
| { | ||
| UNUSED(uncompressed_size, source_size); |
There was a problem hiding this comment.
Let's do instead in l. 78: UInt32 /*uncompressed_size*/)
| ASSERT_THROW(codec->decompress(source, source_size, memory.data()), Exception); | ||
| } | ||
|
|
||
| TEST(FSSTTest, CompressDecompress) |
There was a problem hiding this comment.
C++ unit tests are kind of underused in ClickHouse. Imho, all of what this test does can be achieved with a SQL-based test. The advantage of the latter is that it will run with all kinds of sanitizers in CI.
Suggest to remove this test and extend the SQL test instead.
|
|
||
| CREATE TABLE table_fsst_codec (n String CODEC(FSST)) ENGINE = Memory; | ||
| INSERT INTO table_fsst_codec VALUES ('Hello'), ('world'), ('!'); | ||
| SELECT * FROM table_fsst_codec; |
| { | ||
| dest = writeVarUInt(len_in[i], dest); | ||
|
|
||
| auto decompressed_size = fsst_decompress( |
There was a problem hiding this comment.
Docs in fsst.h say If > size, the decoded output is truncated to size.. Can this happen? Should we check?
Hello! Here are the compression ratio on various datasets: Trips - pickup_ntaname and dropoff_ntaname columns:Amazon reviews - review_body column:Reddit comments - body column:To summarize, FSST is significantly inferior to ZSTD in terms of compression ratio, but in some cases (mainly on large rows) it surpasses LZ4. We will add the results of the performance measurement soon. |
|
Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well... I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately, @Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov): Option 1: Expand the There are two sub-cases:
Option 2: If a String (possibly |
At the moment, I'm trying to implement method number 2. It seemed to me that the first option was almost impossible due to the huge number of abstractions, that need to be filtered through, such as streams, compressed buffers, various Select processor components, and so on. The second way I'm trying to implement is like CompressedColumn, which has a slightly similar idea. |
|
Dear @rschu1ze, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
|
There are unfortunately too many unknown unknowns, e.g., what is the performance (compression/decompression) and size impact of ColumFSST, while the implementation is not finished. I still like the direction (FSSTs as internal representation for String/FixedString columns, as well as predicate pushdown to the codec level) but this needs more experimentation. I guess we should give it another try next year. |
|
@rschu1ze, the next year is today. |










Closes #34246
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Added FSST (Fast Static Symbol Table) as a new compression codec for string columns.
Documentation entry for user-facing changes
Modify your CI run:
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
Exclude tests:
Extra options:
Only specified batches in multi-batch jobs: