Conversation
|
Thanks! Please take a look at the bugs found by fuzzers. |
|
Workflow [PR], commit [174dcc9] Summary: ❌
|
| #include <Columns/ColumnConst.h> | ||
| #include <Columns/ColumnFixedString.h> | ||
| #include <DataTypes/DataTypeFixedString.h> | ||
| #include <DataTypes/DataTypeNullable.h> | ||
| #include <DataTypes/DataTypesNumber.h> | ||
| #include <Functions/FunctionFactory.h> | ||
| #include <Functions/FunctionHelpers.h> | ||
| #include <Functions/IFunction.h> | ||
| #include <base/types.h> |
There was a problem hiding this comment.
Not all includes are actually used. Removing unnecessary ones (in all committed .cpp/h files) would improve readability, build times and prevent transitive includes
| REGISTER_FUNCTION(Quantize16Bit) | ||
| { | ||
| FunctionDocumentation::Description description = " "; | ||
| FunctionDocumentation::Syntax syntax = " "; | ||
| FunctionDocumentation::Arguments argument = {{" ", " "}}; | ||
| FunctionDocumentation::ReturnedValue returned_value = {" "}; | ||
| FunctionDocumentation::Examples examples = {{" ", " ", " "}}; | ||
| FunctionDocumentation::IntroducedIn introduced_in = {25, 10}; | ||
| FunctionDocumentation::Category categories = FunctionDocumentation::Category::Unknown; | ||
| FunctionDocumentation documentation = {description, syntax, argument, returned_value, examples, introduced_in, categories}; | ||
| factory.registerFunction<FunctionQuantize16Bit>(documentation); | ||
| } |
There was a problem hiding this comment.
I added documentation templates in all files where needed. Please fill them up to help the users to use this feature
| #include <cstdint> | ||
|
|
||
|
|
||
| namespace DB |
There was a problem hiding this comment.
It is difficult to understand what is happening without any context. To make it easier, add a comment to explain what this file is for at the top of the.h files. Refer to any of these for examples:
|
|
||
| String getName() const override { return name; } | ||
| size_t getNumberOfArguments() const override { return 1; } | ||
| bool isInjective(const ColumnsWithTypeAndName &) const override { return false; } |
There was a problem hiding this comment.
It's false by default, we don't need to override
| <fill_query> | ||
|
|
||
| ALTER TABLE test.vectors | ||
| UPDATE vector_quantized = quantize8Bit(vector, 2048) |
| <fill_query> | ||
|
|
||
| ALTER TABLE test.vectors | ||
| UPDATE vector_quantized = quantize8Bit(vector, 2048) |
There was a problem hiding this comment.
| ADD COLUMN vector_quantized FixedString(4096); | ||
|
|
||
| </fill_query> | ||
|
|
||
| <fill_query> | ||
|
|
||
| ALTER TABLE test.vectors | ||
| UPDATE vector_quantized = quantize16Bit(vector, 2048) |
There was a problem hiding this comment.
Why do we need FixedString(4096) if we then use only 2048 bytes?
| return std::make_shared<DataTypeFloat32>(); | ||
| } | ||
|
|
||
| ColumnPtr executeImpl(const ColumnsWithTypeAndName & arguments, const DataTypePtr & result_type, size_t input_rows_count) const override |
There was a problem hiding this comment.
:) CREATE TABLE sfp8 (`id` String, `quantized` FixedString(384)) ENGINE = MergeTree ORDER BY id;
:) INSERT INTO sfp8 SELECT id, quantizeSFP8Bit(vector, 384) FROM hackernews;
Code: 49.DB::Exception: Block structure mismatch in function connect between ApplySquashingTransform and ConvertingTransform stream: different columns:
quantized FixedString(384) FixedString(size = 0)
quantized FixedString(384) FixedString(size = 0). (LOGICAL_ERROR)But with an extra byte it works
:) CREATE TABLE sfp8 (`id` String, `quantized` FixedString(385)) ENGINE = MergeTree ORDER BY id;Same story with quantizeMini8Bit
There was a problem hiding this comment.
You can use the small version of hackernews so that you don't have to download the huge one
|
Closing for now due to missing documentation and a bug, but hope someone can build on this work later. @nikita4109 maybe you will finish it one day :) |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Functions for quantizations
These additions enable more efficient storage and searching of vector embeddings, which are essential for semantic search, recommendation systems, and other machine learning applications.
Documentation entry for user-facing changes
Performance Benchmark Results
Random vectors
Hacker News comments
Quality Comparison
Experimental Setup
Methods Tested
Quantization Methods:
Distance Metrics: