[Compression] Patas Compression (float/double) (variation on Chimp)#5044
[Compression] Patas Compression (float/double) (variation on Chimp)#5044Mytherin merged 63 commits intoduckdb:masterfrom
Conversation
… states+functions
… every value of a group, no memmove+memcpy needed anymore
…emcopies from that buffer
… because linux wouldn't compile)
…ay declarations to cpp files
Mytherin
left a comment
There was a problem hiding this comment.
Thanks for the PR! The code all looks good to me.
One comment:
| patas_state.StartNewSegment(); | ||
| const auto final_analyze_size = patas_state.TotalUsedBytes(); | ||
| // printf("ANALYZE: ROWGROUP_SIZE: %llu\n", final_analyze_size); | ||
| return final_analyze_size; |
There was a problem hiding this comment.
Both Patas and Chimp return the total used bytes, which means Patas will almost never be chosen, as it generally has a slightly worse compression ratio. Could we penalize Chimp (and perhaps penalize Patas slightly over uncompressed)? e.g. Chimp could have a *2 multiplier, and Patas a *1.2 or so.
There was a problem hiding this comment.
Yes, I remember you mentioned I should do this, just didn't really click how yet.
But this way makes perfect sense, I'll add it 👍
There was a problem hiding this comment.
I think we actually want to have this function return a struct at some point in the future, e.g. something like:
struct CompressionInfo {
idx_t compression_speed;
idx_t decompression_speed;
idx_t estimated_size;
};That way, we can check what we care about ourselves and add e.g. a configuration. We could also do stuff like, "for temporary tables we care less about estimated size and more about decompression speed than for persistent tables", etc.
Anyway, future PR :)
There was a problem hiding this comment.
I like that idea, though I feel those would kind of be magic numbers.
Maybe instead make them a double and do them relative to speed of uncompressed (though that could also differ per type..)
But that way we would only need to measure the speed of uncompressed and apply the multiplier to figure out the (estimated) speed of the compression algorithm
We could even expose those constants to be able to verify in tests that they hold true (by a margin)
… the compression ratio weighs out the slowness of decompression
|
Thanks! |
This PR adds another compression method, which is very similar to Chimp, but can decompress in a fraction of the time that Chimp takes to decompress (from 5x slower to 2x slower than uncompressed).
Chimp stores 2 flag bits to indicate the compression used, Patas takes one of those methods with a slight variation.
Optimizing for highest trailing zeros
They both make use of a very clever way of maximizing the trailing zeros (which is taken directly from the Chimp128 paper), by using the least significant bits of the number to index into an array that stores the index into a circular buffer (with a size of 128).
Using this we can find the value that shares the least significant bits with the value we're currently compressing.
Because these bits are identical, the XOR result will turn all those bits into 0s.
The reference index is the difference between our current index and that of the previous value (always between 0-127).
Chimp
Chimp checks the amount of trailing zeros in the XOR result, and if it exceeds a certain threshold, it compresses in this manner:
Combine the reference index (7 bits), the leading zeros (3 bits) and the amount of significant bits (6 bits) into a single 2-byte integer.
Then store the significant bits.
Patas
What Patas does is similar:
Combine the reference index (7 bits), the amount of significant bytes (3 bits), and the trailing zeros (6 bits) into a single 2-byte integer.
Then store the significant bytes.
By writing the significant bits in a byte-aligned way, reading is much faster because we will never have to deal with any bit-level offsets.
Also because every value will have this "packed" data of 16 bits, there are no separate indices for every array of data we might need, which is the case in Chimp.