Skip to content

[Compression] Patas Compression (float/double) (variation on Chimp)#5044

Merged
Mytherin merged 63 commits intoduckdb:masterfrom
Tishj:patas_compression
Nov 3, 2022
Merged

[Compression] Patas Compression (float/double) (variation on Chimp)#5044
Mytherin merged 63 commits intoduckdb:masterfrom
Tishj:patas_compression

Conversation

@Tishj
Copy link
Contributor

@Tishj Tishj commented Oct 20, 2022

This PR adds another compression method, which is very similar to Chimp, but can decompress in a fraction of the time that Chimp takes to decompress (from 5x slower to 2x slower than uncompressed).

Chimp stores 2 flag bits to indicate the compression used, Patas takes one of those methods with a slight variation.

Optimizing for highest trailing zeros

They both make use of a very clever way of maximizing the trailing zeros (which is taken directly from the Chimp128 paper), by using the least significant bits of the number to index into an array that stores the index into a circular buffer (with a size of 128).
Using this we can find the value that shares the least significant bits with the value we're currently compressing.
Because these bits are identical, the XOR result will turn all those bits into 0s.

The reference index is the difference between our current index and that of the previous value (always between 0-127).

Chimp

Chimp checks the amount of trailing zeros in the XOR result, and if it exceeds a certain threshold, it compresses in this manner:
Combine the reference index (7 bits), the leading zeros (3 bits) and the amount of significant bits (6 bits) into a single 2-byte integer.
Then store the significant bits.

Patas

What Patas does is similar:
Combine the reference index (7 bits), the amount of significant bytes (3 bits), and the trailing zeros (6 bits) into a single 2-byte integer.
Then store the significant bytes.

By writing the significant bits in a byte-aligned way, reading is much faster because we will never have to deal with any bit-level offsets.
Also because every value will have this "packed" data of 16 bits, there are no separate indices for every array of data we might need, which is the case in Chimp.

Tishj added 30 commits October 6, 2022 12:39
… every value of a group, no memmove+memcpy needed anymore
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The code all looks good to me.

One comment:

patas_state.StartNewSegment();
const auto final_analyze_size = patas_state.TotalUsedBytes();
// printf("ANALYZE: ROWGROUP_SIZE: %llu\n", final_analyze_size);
return final_analyze_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Patas and Chimp return the total used bytes, which means Patas will almost never be chosen, as it generally has a slightly worse compression ratio. Could we penalize Chimp (and perhaps penalize Patas slightly over uncompressed)? e.g. Chimp could have a *2 multiplier, and Patas a *1.2 or so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I remember you mentioned I should do this, just didn't really click how yet.

But this way makes perfect sense, I'll add it 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually want to have this function return a struct at some point in the future, e.g. something like:

struct CompressionInfo {
   idx_t compression_speed;
   idx_t decompression_speed;
   idx_t estimated_size;
};

That way, we can check what we care about ourselves and add e.g. a configuration. We could also do stuff like, "for temporary tables we care less about estimated size and more about decompression speed than for persistent tables", etc.

Anyway, future PR :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea, though I feel those would kind of be magic numbers.
Maybe instead make them a double and do them relative to speed of uncompressed (though that could also differ per type..)
But that way we would only need to measure the speed of uncompressed and apply the multiplier to figure out the (estimated) speed of the compression algorithm
We could even expose those constants to be able to verify in tests that they hold true (by a margin)

@Mytherin Mytherin merged commit d40d997 into duckdb:master Nov 3, 2022
@Mytherin
Copy link
Collaborator

Mytherin commented Nov 3, 2022

Thanks!

@Tishj Tishj deleted the patas_compression branch November 7, 2025 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants