Add FSST as compression codec by Pelanglene · Pull Request #62670 · ClickHouse/ClickHouse

Pelanglene · 2024-04-15T21:24:03Z

Closes #34246

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Added FSST (Fast Static Symbol Table) as a new compression codec for string columns.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

Modify your CI run:

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

Exclude tests:

Extra options:

do not test (only style check)
disable merge-commit (no merge from master before tests)
disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

1
2
3
4

CLAassistant · 2024-04-15T21:24:10Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ nikitamikhaylov
✅ rschu1ze
❌ Ulad

Ulad seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

clickhouse-ci · 2024-04-17T10:02:07Z

This is an automatic comment. The PR descriptions does not match the template.

Please, edit it accordingly.

The error is: More than one changelog category specified: 'New Feature', 'Improvement'

vdimir · 2024-04-17T10:02:11Z

Ref #34246

robot-ch-test-poll3 · 2024-04-25T10:57:14Z

This is an automated comment for commit e6a7325 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
A Sync	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	⏳ pending
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	❌ failure
Mergeable Check	Checks if all other necessary checks are successful	❌ failure
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	❌ failure

Successful checks

Check name	Description	Status
Docs check	Builds and tests the documentation	✅ success
PR Check	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success

alexey-milovidov · 2024-05-07T19:42:06Z

Option 1: Expand the ICompressionCodec interface, add an optional Filter parameter to the decompression method—a structure containing, for example, required_substrings. The codec pretends to decompress all data but has the right to decompress unnecessary strings as empty. The filter is created in the query interpreter (more precisely, during query analysis) and then is carried through all data reading interfaces down to decompression.

Option 2: If a String (possibly Nullable, Array of String) uses the FSST codec, during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.

Without these ways to accelerate, the codec does not make much sense - for raw performance/compression rate, it is strictly worse than LZ4 or ZSTD.

rschu1ze · 2024-05-08T15:43:00Z

src/Compression/CompressionCodecFSST.cpp

+
+void registerCodecFSST(CompressionCodecFactory & factory)
+{
+    auto codec_builder = [&](const ASTPtr & arguments) -> CompressionCodecPtr


I guess we should restrict FSST to String and FixedString columns here. Check the registerCodec method src/Compression/CompressionCodecGorilla.cpp (which is a floating-point-value-only codec) how to do that.

Also, we should check that no further parameters were passed into CODEC FSST() when defined in SQL. Please see the registerCodec method in src/Compression/CompressionCodecGCD.cpp.

Both cases also need to have a negative SQL test, i.e. a test case in 02973_fsst_code_test_data.sql that expect an exception:

CREATE TABLE table_fsst_codec (n String CODEC(FSST('an_additional_argument'))) ENGINE = Memory; -- { serverError ILLEGAL_SYNTAX_FOR_CODEC_TYPE }

rschu1ze · 2024-05-08T15:46:27Z

src/Compression/CompressionCodecFSST.cpp

+
+    void updateHash(SipHash & hash) const override { getCodecDesc()->updateTreeHash(hash, /*ignore_aliases=*/true); }
+
+    static const int OUT_SIZE = 2281337;


Magic constant, please add a comment behind it how it was calculated.

rschu1ze · 2024-05-08T15:49:10Z

src/Compression/CompressionCodecFSST.cpp

+
+namespace DB
+{
+


For my understanding: Why are we using the original FSST version and not the FSST12 variant?

+ /// Implements FSST compression based on https://github.com/cwida/fsst

rschu1ze · 2024-05-08T15:49:44Z

src/Compression/CompressionCodecFSST.cpp

+{
+public:
+    explicit CompressionCodecFSST() { setCodecDescription("FSST"); }
+


Could we mark the codec experimental for now? (ICompressionCodec::isExperimental)

rschu1ze · 2024-05-08T15:55:23Z

contrib/fsst-cmake/CMakeLists.txt

@@ -0,0 +1,19 @@
+option(ENABLE_FSST "Enable FSST (Fast Static Symbol Table)" ${ENABLE_LIBRARIES})
+
+if (NOT ENABLE_FSST)


Note to myself: the AVX512 detection at the beginning of fsst_avx512.cpp looks x86-specific. Test what happens on ARM, possibly exclude fsst_avx512.cpp from compilation on non-x86 platforms.

rschu1ze

I am curious how compression ration and performance compares to ZSTD?

You could load any of these datasets https://clickhouse.com/docs/en/getting-started/example-datasets and re-compress the string columns with different codecs, then do full-column scans over then so they are decompressed.

rschu1ze · 2024-05-08T19:56:47Z

src/CMakeLists.txt

 endif ()

 target_link_libraries (clickhouse_common_io PRIVATE ch_contrib::lz4)
+# target_link_libraries (clickhouse_common_io PRIVATE ch_contrib::fsst)


Let's remove this.

rschu1ze · 2024-05-08T19:57:13Z

src/CMakeLists.txt

 )

+if (TARGET ch_contrib::fsst)
+    dbms_target_link_libraries(PRIVATE ch_contrib::fsst)


Please try if things still build if we only link clickhouse_compression, i.e omit l. 422.

rschu1ze · 2024-05-08T19:58:19Z

src/Compression/CompressionCodecFSST.cpp

+
+namespace DB
+{
+


+ /// Implements FSST compression based on https://github.com/cwida/fsst

rschu1ze · 2024-05-08T19:59:03Z

src/Compression/CompressionCodecFSST.cpp

+
+    void updateHash(SipHash & hash) const override { getCodecDesc()->updateTreeHash(hash, /*ignore_aliases=*/true); }
+
+    static constexpr int out_size = 2281337;


Needs a comment which explains how the constant was selected.

rschu1ze · 2024-05-08T20:01:34Z

src/Compression/CompressionCodecFSST.cpp

+
+        size_t len_out[rows_count];
+        unsigned char * str_out[rows_count];
+        size_t header_size{fsst_header_size + sizeof(rows_count) + sizeof(len_out) + (sizeof(size_t) * len_in.size())};


Minor: Let's use standard = initialization instead of use uniform initialization. The former is how it is done in the rest of the codebase. Please change all such places in this file.

rschu1ze · 2024-05-08T20:11:10Z

src/Compression/CompressionCodecFSST.cpp

+        }
+    }
+
+    UInt32 getMaxCompressedDataSize(UInt32 uncompressed_size) const override { return uncompressed_size + FSST_MAXHEADER; }


Let's move l. 110-113 up to l. 31 so all "metadata" functions are in one place.

rschu1ze · 2024-05-08T20:14:06Z

src/Compression/CompressionCodecFSST.cpp

+
+        while (data != end)
+        {
+            UInt64 cur_len;


I am pleasantly surprised that ClickHouse serializes String columns as is consecutive strings instead as two arrays (strings + length). That, of course, simplifies the job of splitDataByRows.

rschu1ze · 2024-05-08T20:17:17Z

src/Compression/CompressionCodecFSST.cpp

+
+    void doDecompressData(const char * source, UInt32 source_size, char * dest, UInt32 uncompressed_size) const override
+    {
+        UNUSED(uncompressed_size, source_size);


Let's do instead in l. 78: UInt32 /*uncompressed_size*/)

rschu1ze · 2024-05-08T20:20:00Z

src/Compression/tests/gtest_compressionCodec.cpp

    ASSERT_THROW(codec->decompress(source, source_size, memory.data()), Exception);
 }

+TEST(FSSTTest, CompressDecompress)


C++ unit tests are kind of underused in ClickHouse. Imho, all of what this test does can be achieved with a SQL-based test. The advantage of the latter is that it will run with all kinds of sanitizers in CI.

Suggest to remove this test and extend the SQL test instead.

rschu1ze · 2024-05-08T20:20:20Z

tests/queries/0_stateless/02973_fsst_code_test_data.sql

+
+CREATE TABLE table_fsst_codec (n String CODEC(FSST)) ENGINE = Memory;
+INSERT INTO table_fsst_codec VALUES ('Hello'), ('world'), ('!');
+SELECT * FROM table_fsst_codec;


+ DROP TABLE

rschu1ze · 2024-05-08T20:23:44Z

src/Compression/CompressionCodecFSST.cpp

+        {
+            dest = writeVarUInt(len_in[i], dest);
+
+            auto decompressed_size = fsst_decompress(


Docs in fsst.h say If > size, the decoded output is truncated to size.. Can this happen? Should we check?

trololo23 · 2024-05-10T12:45:14Z

I am curious how compression ration and performance compares to ZSTD?

You could load any of these datasets https://clickhouse.com/docs/en/getting-started/example-datasets and re-compress the string columns with different codecs, then do full-column scans over then so they are decompressed.

Hello! Here are the compression ratio on various datasets:

Trips - pickup_ntaname and dropoff_ntaname columns:

FSST
STRING(LZ4)
LowCardinality
ZSTD

Amazon reviews - review_body column:

FSST
STRING(LZ4)
ZSTD

Reddit comments - body column:

FSST
STRING(LZ4)
ZSTD

To summarize, FSST is significantly inferior to ZSTD in terms of compression ratio, but in some cases (mainly on large rows) it surpasses LZ4. We will add the results of the performance measurement soon.

rschu1ze · 2024-05-14T10:03:29Z

Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well...

I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately, ICompressionCodec::doDecompress decompresses the entire block at once, so the existing API can't leverage random access.

@Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov):

Option 1: Expand the ICompressionCodec interface for random access.

There are two sub-cases:

get: decompress selected rows instead of the whole thing. This is useful for projection when most rows were removed by filters already. API-wise, one needs to pass in a bit vector into doDecompress (or an overload of doDecompress) to indicate the rows to decompress. FSST is well suited for that use case.
filter: aka. this is filter pushdown to the codec level. Some codecs (including FSST) allow to evaluate predicates on compressed data, i.e. without decompression. As an example, the simplest case would be an equality filter on dictionary/domain-compresed (LowCardinality in ClickHouse) columns where one would translate x = 'abc' into the value id of x, and then scan the (compressed) value id vector for matches. The question is how to represent the filter ... I guess arbitrary operator + a list of arbitrary operand(s) will do the trick. This is as expressive as everything what the user can write in SQL. To take advantage of this, one also needs to create the filter in the query interpreter (more precisely, in query analysis) and carry it through all data reading interfaces down to the codec level.

Option 2: If a String (possibly Nullable, Array(String)) uses the FSST codec, then during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.

trololo23 · 2024-05-21T12:46:38Z

Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well...

I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately, ICompressionCodec::doDecompress decompresses the entire block at once, so the existing API can't leverage random access.

@Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov):

Option 1: Expand the ICompressionCodec interface for random access.

There are two sub-cases:

get: decompress selected rows instead of the whole thing. This is useful for projection when most rows were removed by filters already. API-wise, one needs to pass in a bit vector into doDecompress (or an overload of doDecompress) to indicate the rows to decompress. FSST is well suited for that use case.

filter: aka. this is filter pushdown to the codec level. Some codecs (including FSST) allow to evaluate predicates on compressed data, i.e. without decompression. As an example, the simplest case would be an equality filter on dictionary/domain-compresed (LowCardinality in ClickHouse) columns where one would translate x = 'abc' into the value id of x, and then scan the (compressed) value id vector for matches. The question is how to represent the filter ... I guess arbitrary operator + a list of arbitrary operand(s) will do the trick. This is as expressive as everything what the user can write in SQL. To take advantage of this, one also needs to create the filter in the query interpreter (more precisely, in query analysis) and carry it through all data reading interfaces down to the codec level.

Option 2: If a String (possibly Nullable, Array(String)) uses the FSST codec, then during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.

At the moment, I'm trying to implement method number 2. It seemed to me that the first option was almost impossible due to the huge number of abstractions, that need to be filtered through, such as streams, compressed buffers, various Select processor components, and so on. The second way I'm trying to implement is like CompressedColumn, which has a slightly similar idea.

clickhouse-gh · 2024-07-02T13:04:48Z

Dear @rschu1ze, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

rschu1ze · 2024-07-02T21:01:48Z

There are unfortunately too many unknown unknowns, e.g., what is the performance (compression/decompression) and size impact of ColumFSST, while the implementation is not finished. I still like the direction (FSSTs as internal representation for String/FixedString columns, as well as predicate pushdown to the codec level) but this needs more experimentation. I guess we should give it another try next year.

alexey-milovidov · 2025-06-07T19:35:15Z

@rschu1ze, the next year is today.

Ulad and others added 17 commits January 26, 2024 19:55

Ass FSST as submodule

ae4cf6f

fsst codec init

28917fc

fsst codec fix

415eb13

Add cmake

6890591

Another try with CMake

6ac3c94

Working CMake v1

d2c49c9

start fsst codec

7d148a3

Decompress function

5258240

aboba

f5092fc

Add to CodecFactory

7779c53

fix

af5131d

Add codec to test file

4ec672e

Draft

a40fc23

TestDraft

e0edc09

dymok

95378e4

fix bug

3446384

fix

c4b4091

nikitamikhaylov added the can be tested Allows running workflows for external contributors label Apr 16, 2024

vdimir changed the title ~~Fsst~~ Add FSST (Fast Static Symbol Table) Compression Codec Apr 17, 2024

trololo23 and others added 4 commits April 23, 2024 16:37

Draft

7dde246

fix bugs in fsst codec

af66ea6

fix test for fsst codec

c582b83

Merge branch 'master' into fsst

6526370

robot-ch-test-poll3 added pr-improvement Pull request with some product improvements submodule changed At least one submodule changed in this PR. labels Apr 25, 2024

Remove debug logging

36ba9a8

trololo23 and others added 10 commits May 2, 2024 22:21

Lol

1af8f59

Try

bd7050c

Fix cmakes and update fsst to master

e254ef1

Left only cpp file

149348b

Add an option to disable FSST in build

e5507ee

Small fix

e68e650

Fix ifdef

cb7661f

Style fix

34013c1

Remove stupid logging

bbdd2f5

gtest small fix

2981a2e

Merge remote-tracking branch 'ClickHouse/master' into fsst

746d38e

rschu1ze changed the title ~~Add FSST (Fast Static Symbol Table) Compression Codec~~ Add FSST Compression Codec May 8, 2024

rschu1ze changed the title ~~Add FSST Compression Codec~~ Add FSST as compression codec May 8, 2024

Cosmetics

eeca96d

rschu1ze reviewed May 8, 2024

View reviewed changes

trololo23 added 3 commits May 29, 2024 08:47

Rework stored data

298d247

Experimental ColumnFSST work

5c17983

Add new column

e6a7325

clickhouse-gh bot unassigned rschu1ze Jul 2, 2024

rschu1ze closed this Jul 2, 2024

rienath mentioned this pull request Oct 8, 2025

Intern Tasks 2025/2026 #87836

Open


		void updateHash(SipHash & hash) const override { getCodecDesc()->updateTreeHash(hash, /ignore_aliases=/true); }

		static const int OUT_SIZE = 2281337;

		@@ -0,0 +1,19 @@
		option(ENABLE_FSST "Enable FSST (Fast Static Symbol Table)" ${ENABLE_LIBRARIES})

		if (NOT ENABLE_FSST)


		void updateHash(SipHash & hash) const override { getCodecDesc()->updateTreeHash(hash, /ignore_aliases=/true); }

		static constexpr int out_size = 2281337;


		namespace DB
		{


		namespace DB
		{

Conversation

Pelanglene commented Apr 15, 2024 • edited by rschu1ze Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Modify your CI run:

Include tests (required builds will be added automatically):

Exclude tests:

Extra options:

Only specified batches in multi-batch jobs:

Uh oh!

CLAassistant commented Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-ci bot commented Apr 17, 2024

Uh oh!

vdimir commented Apr 17, 2024

Uh oh!

robot-ch-test-poll3 commented Apr 25, 2024 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented May 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rschu1ze left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trololo23 commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Trips - pickup_ntaname and dropoff_ntaname columns:

Amazon reviews - review_body column:

Reddit comments - body column:

Uh oh!

rschu1ze commented May 14, 2024

Uh oh!

trololo23 commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh bot commented Jul 2, 2024

Uh oh!

rschu1ze commented Jul 2, 2024

Uh oh!

alexey-milovidov commented Jun 7, 2025

Uh oh!

Reviewers

Assignees

Pelanglene commented Apr 15, 2024 •

edited by rschu1ze

Loading

CLAassistant commented Apr 15, 2024 •

edited

Loading

robot-ch-test-poll3 commented Apr 25, 2024 •

edited by robot-clickhouse

Loading

trololo23 commented May 10, 2024 •

edited

Loading

trololo23 commented May 21, 2024 •

edited

Loading