-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Functions for rolling hashes #81183
Copy link
Copy link
Open
Labels
Description
Company or project name
ClickHouse
Use case
Segment a string into substrings in a way that this segmentation is robust (mostly unchanged) with respect to appending and prepending fragments to the string.
A classic method, known as "content-defined chunking", "binary diff", and "rsync algorithm".
It is needed for deduplication in databases with binary data, such as executable files, code repositories, etc.
Describe the solution you'd like
Add functions:
-- returns an array of substrings, covering the string
-- the `reverse_probability` is an integer, e.g., if it is equal to 1000, then the chunks will be formed every 1000 bytes on average;
contentDefinedChunks(str, window_size, reverse_probability)
-- same, but steps over code points, so that the resulting chunks contain whole code points
contentDefinedChunksUTF8(str, window_size, reverse_probability)
-- returns an array of offsets of splitting points in the string instead of substrings
contentDefinedChunkOffsets(str, window_size, reverse_probability)
-- same, but steps over code points, so that the resulting chunks contain whole code points
contentDefinedChunkOffsetsUTF8(str, window_size, reverse_probability)
It's unspecified which rolling hash we will use, but the choice should be reasonable.
Describe alternatives you've considered
No response
Additional context
No response
Reactions are currently unavailable