[Compression] Patas Compression (float/double) (variation on Chimp) by Tishj · Pull Request #5044 · duckdb/duckdb

Tishj · 2022-10-20T15:51:54Z

This PR adds another compression method, which is very similar to Chimp, but can decompress in a fraction of the time that Chimp takes to decompress (from 5x slower to 2x slower than uncompressed).

Chimp stores 2 flag bits to indicate the compression used, Patas takes one of those methods with a slight variation.

Optimizing for highest trailing zeros

They both make use of a very clever way of maximizing the trailing zeros (which is taken directly from the Chimp128 paper), by using the least significant bits of the number to index into an array that stores the index into a circular buffer (with a size of 128).
Using this we can find the value that shares the least significant bits with the value we're currently compressing.
Because these bits are identical, the XOR result will turn all those bits into 0s.

The reference index is the difference between our current index and that of the previous value (always between 0-127).

Chimp

Chimp checks the amount of trailing zeros in the XOR result, and if it exceeds a certain threshold, it compresses in this manner:
Combine the reference index (7 bits), the leading zeros (3 bits) and the amount of significant bits (6 bits) into a single 2-byte integer.
Then store the significant bits.

Patas

What Patas does is similar:
Combine the reference index (7 bits), the amount of significant bytes (3 bits), and the trailing zeros (6 bits) into a single 2-byte integer.
Then store the significant bytes.

By writing the significant bits in a byte-aligned way, reading is much faster because we will never have to deal with any bit-level offsets.
Also because every value will have this "packed" data of 16 bits, there are no separate indices for every array of data we might need, which is the case in Chimp.

… states+functions

… every value of a group, no memmove+memcpy needed anymore

…emcopies from that buffer

…ossible

… because linux wouldn't compile)

…ay declarations to cpp files

Mytherin

Thanks for the PR! The code all looks good to me.

One comment:

Mytherin · 2022-10-21T07:06:26Z

src/include/duckdb/storage/compression/patas/patas_analyze.hpp

+	patas_state.StartNewSegment();
+	const auto final_analyze_size = patas_state.TotalUsedBytes();
+	// printf("ANALYZE: ROWGROUP_SIZE: %llu\n", final_analyze_size);
+	return final_analyze_size;


Both Patas and Chimp return the total used bytes, which means Patas will almost never be chosen, as it generally has a slightly worse compression ratio. Could we penalize Chimp (and perhaps penalize Patas slightly over uncompressed)? e.g. Chimp could have a *2 multiplier, and Patas a *1.2 or so.

Yes, I remember you mentioned I should do this, just didn't really click how yet.

But this way makes perfect sense, I'll add it 👍

I think we actually want to have this function return a struct at some point in the future, e.g. something like:

struct CompressionInfo { idx_t compression_speed; idx_t decompression_speed; idx_t estimated_size; };

That way, we can check what we care about ourselves and add e.g. a configuration. We could also do stuff like, "for temporary tables we care less about estimated size and more about decompression speed than for persistent tables", etc.

Anyway, future PR :)

I like that idea, though I feel those would kind of be magic numbers.
Maybe instead make them a double and do them relative to speed of uncompressed (though that could also differ per type..)
But that way we would only need to measure the speed of uncompressed and apply the multiplier to figure out the (estimated) speed of the compression algorithm
We could even expose those constants to be able to verify in tests that they hold true (by a margin)

… the compression ratio weighs out the slowness of decompression

…s_compression

…time

Mytherin · 2022-11-03T10:20:59Z

Thanks!

Tishj added 30 commits October 6, 2022 12:39

started thinking about the structure of the compression/decompression…

173ea0d

… states+functions

writing the compression

568ece3

adding 'compress' step of Patas in the Compression API

81063aa

add patas.cpp

0f64e75

Merge branch 'chimp_hybrid' into patas_compression

6582b51

Merge branch 'master' into patas_compression

0661777

add analyze + scan + fetch

e35d80e

compiling now :)

03c2f78

passing simple test

b256ce1

forgot to reset the group state

3f21c00

add patas to the list of <compression>

7654eb6

Merge branch 'master' into patas_compression

3ea6663

fixed compression of values that take up 8 bytes

42d227f

taking leading zeros into account now

a634093

removed over-engineered 'optimization'

47ca4ef

optimizations

c51bfac

further optimization

83cccc1

strip out special code for the first of a group, not necessary

760a4bb

get rid of 'previous_value' in the patas_state

3bb83bb

added ring buffer for better previous values

2e73bd5

replaced 128 value sized 'cache' by 1024 value sized array that holds…

f69f932

… every value of a group, no memmove+memcpy needed anymore

added optimization for scanning a group from the start

62f8420

vectorized scan to an intermediate buffer, the actual scan now just m…

a708615

…emcopies from that buffer

add fixme for potential optimization

1d813a8

small optimizations

1cde368

move loading from after scan to before scan

d101df8

skipped intermediate scan if we are scanning an entire group

635b843

removed unused struct

ab24655

switch to packed data, instead of 3 separate bitpacked arrays

c93baa0

remove debug prints

b32f967

Tishj added 19 commits October 14, 2022 12:25

store equal as 0 bytes instead of 1

dc93bd7

analyze now matches the actual size of the compression (off by 0.01~%)

bba151d

optimized skip, to completely skip a group + reading metadata where p…

9fc7270

…ossible

fix compile error~

d21916c

Merge branch 'master' into patas_compression

1e2d4ae

merged with chimp_hybrid

05e600b

add tests I forgot to commit

7b09d0e

commit forgotten file

d435934

removed unnecessary duckdb::

c9064d0

simplify and clean up ByteReader templated ReadValue<T>

3e4e604

added extra protection against writing more than size of type (mostly…

d6207f5

… because linux wouldn't compile)

added test for Skip functionality using Patas

3490e03

removed wrong assertion|

a5c6916

fix issue in byte_reader on 'bytes' as 0 for floats

208d217

merged with chimp_hybrid

321812f

fix small bugs

76e716a

Merge branch 'master' into patas_compression

13394b5

(hopefully) fix multiple definition issue by moving the constexpr arr…

f21473f

…ay declarations to cpp files

code quality fix

3eecd11

Mytherin reviewed Oct 21, 2022

View reviewed changes

Tishj and others added 7 commits October 21, 2022 10:09

add tpcds and tpch test for patas

5a55794

multiply the final analyze sizes by a constant to only choose them if…

f99c61e

… the compression ratio weighs out the slowness of decompression

Merge branch 'master' into patas_compression

57d3d89

Merge branch 'patas_compression' of github.com:Tishj/duckdb into pata…

c311368

…s_compression

fixed a bug where Skip would call LoadGroup an (unneeded) additional …

75a4225

…time

Merge branch 'master' into patas_compression

bb99cfc

re-run CI? - code-cov failed to upload

a47f668

Mytherin merged commit d40d997 into duckdb:master Nov 3, 2022

Tishj deleted the patas_compression branch November 7, 2025 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Compression] Patas Compression (float/double) (variation on Chimp)#5044

[Compression] Patas Compression (float/double) (variation on Chimp)#5044
Mytherin merged 63 commits intoduckdb:masterfrom
Tishj:patas_compression

Tishj commented Oct 20, 2022 •

edited

Loading

Uh oh!

Mytherin left a comment

Uh oh!

Mytherin Oct 21, 2022

Uh oh!

Tishj Oct 21, 2022

Uh oh!

Mytherin Oct 21, 2022

Uh oh!

Tishj Oct 21, 2022

Uh oh!

Mytherin commented Nov 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tishj commented Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimizing for highest trailing zeros

Chimp

Patas

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Mytherin Oct 21, 2022

Choose a reason for hiding this comment

Uh oh!

Tishj Oct 21, 2022

Choose a reason for hiding this comment

Uh oh!

Mytherin Oct 21, 2022

Choose a reason for hiding this comment

Uh oh!

Tishj Oct 21, 2022

Choose a reason for hiding this comment

Uh oh!

Mytherin commented Nov 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tishj commented Oct 20, 2022 •

edited

Loading