Skip to content

Functions for rolling hashes #81183

@alexey-milovidov

Description

@alexey-milovidov

Company or project name

ClickHouse

Use case

Segment a string into substrings in a way that this segmentation is robust (mostly unchanged) with respect to appending and prepending fragments to the string.

A classic method, known as "content-defined chunking", "binary diff", and "rsync algorithm".

It is needed for deduplication in databases with binary data, such as executable files, code repositories, etc.

Describe the solution you'd like

Add functions:

-- returns an array of substrings, covering the string
-- the `reverse_probability` is an integer, e.g., if it is equal to 1000, then the chunks will be formed every 1000 bytes on average;
contentDefinedChunks(str, window_size, reverse_probability)

-- same, but steps over code points, so that the resulting chunks contain whole code points
contentDefinedChunksUTF8(str, window_size, reverse_probability)

-- returns an array of offsets of splitting points in the string instead of substrings
contentDefinedChunkOffsets(str, window_size, reverse_probability)

-- same, but steps over code points, so that the resulting chunks contain whole code points
contentDefinedChunkOffsetsUTF8(str, window_size, reverse_probability)

It's unspecified which rolling hash we will use, but the choice should be reasonable.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions