Bit-sliced format for vectors

### Company or project name

ClickHouse

### Use case

Adjusting the precision and speed of vector search on a query level by reading quantized data, which is stored in a format designed in a way that reading subsets of data will lead to reading different levels of quantized data.

This will speed up vector search without the need of building an index and without added complexity of data updates.

### Describe the solution you'd like

Let's suppose you have a vector with BFloat16 of 1024 dimensions.

So, you take the most significant bit from every dimension of the vector, it will be 1024 bits, 128 bytes, and store these 128-bytes for every row it in one stream. Take the second most significant bit from every dimension, and store it in the second stream.

On the query level, you can adjust the desired quantization as the number of bits, using a function, that will optimize to reading subcolumns. It will reconstruct approximate floating point values, using the available information.

This bit-slicing can be performed either as is, so the first bit of BFloat16 designates the sign, then 8 exponent bits, then 7 mantissa bits. The downside of this approach is that the first exponent bits are often unusable, and the information loss is higher. But at the same time, the streams with "low-information" bits will compress ideally, so performance when we read more bits will be good anyway.

Or bit-slicing can be performed with estimating quantiles. For example, for every block of data, we store 15 (= 16 - 1) quantiles in a dictionary stream (similar to LowCardinality and JSON dictionaries), then the first 4 bits will designate the corresponding inter-quantile range, and the next bits will split the range uniformly.

### Describe alternatives you've considered

_No response_

### Additional context

This isn't an indexed search, and it is still O(number of records).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bit-sliced format for vectors #77088

Company or project name

Use case

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bit-sliced format for vectors #77088

Description

Company or project name

Use case

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions