Functions for rolling hashes

### Company or project name

ClickHouse

### Use case

Segment a string into substrings in a way that this segmentation is robust (mostly unchanged) with respect to appending and prepending fragments to the string.

A classic method, known as "content-defined chunking", "binary diff", and "rsync algorithm".

It is needed for deduplication in databases with binary data, such as executable files, code repositories, etc.

### Describe the solution you'd like

Add functions:

```
-- returns an array of substrings, covering the string
-- the `reverse_probability` is an integer, e.g., if it is equal to 1000, then the chunks will be formed every 1000 bytes on average;
contentDefinedChunks(str, window_size, reverse_probability)

-- same, but steps over code points, so that the resulting chunks contain whole code points
contentDefinedChunksUTF8(str, window_size, reverse_probability)

-- returns an array of offsets of splitting points in the string instead of substrings
contentDefinedChunkOffsets(str, window_size, reverse_probability)

-- same, but steps over code points, so that the resulting chunks contain whole code points
contentDefinedChunkOffsetsUTF8(str, window_size, reverse_probability)
```

It's unspecified which rolling hash we will use, but the choice should be reasonable.

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functions for rolling hashes #81183

Company or project name

Use case

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Functions for rolling hashes #81183

Description

Company or project name

Use case

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions