ESQL - dense vector support cosine normalization#132721
ESQL - dense vector support cosine normalization#132721carlosdelest merged 17 commits intoelastic:mainfrom
Conversation
| private final DenseVectorFieldMapper.DenseVectorFieldType fieldType; | ||
|
|
||
| public DenseVectorBlockLoader(String fieldName, int dimensions) { | ||
| public DenseVectorBlockLoader(String fieldName, int dimensions, DenseVectorFieldMapper.DenseVectorFieldType fieldType) { |
There was a problem hiding this comment.
Passing the DenseVectorFieldType looked better than including other specific attributes needed for normalization (fieldType.name(), IndexVersion). I opted for creating a isNormalized() method in the DenseVectorFieldType instead.
| private final FloatVectorValues floatVectorValues; | ||
| private final KnnVectorValues.DocIndexIterator iterator; | ||
| private final int dimensions; | ||
| private abstract static class DenseVectorValuesBlockReader<T extends KnnVectorValues> extends BlockDocValuesReader { |
There was a problem hiding this comment.
Using an abstract class is previous work for supporting byte element type.
| assert floats.length == dimensions | ||
| : "unexpected dimensions for vector value; expected " + dimensions + " but got " + floats.length; | ||
|
|
||
| assert magnitudeDocValues.advanceExact(iterator.docID()); |
There was a problem hiding this comment.
I tried to use DenormalizedCosineFloatVectorValues instead, but that class depends on the ordinal and not the docId() for retrieving the magnitude.
Using the docId() to retrieve the magnitude works.
| public CacheHelper getCoreCacheHelper() { | ||
| return r.getCoreCacheHelper(); | ||
| } | ||
| denseVectorFieldType.isNormalized() && denseVectorFieldType.indexed ? r -> new FilterLeafReader(r) { |
There was a problem hiding this comment.
Used the isNormalized() method to improve readability - that messed up some indentation
| for (String indexType : DENSE_VECTOR_INDEX_TYPES) { | ||
| params.add(new Object[] { indexType, true, false }); | ||
| } | ||
| Supplier<ElementType> elementTypeProvider = () -> ElementType.FLOAT; |
There was a problem hiding this comment.
Testing for all combinations took > 1 min. I decided to rely on randomization for testing the different combinations.
…r-support-normalization' into non-issue/esql-dense-vector-support-normalization # Conflicts: # x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/DenseVectorFieldTypeIT.java
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
|
@nik9000 I'd appreciate your review here, as the doc values reading was tricky. I'd like to ensure I am doing the right thing! |
…r-support-normalization' into non-issue/esql-dense-vector-support-normalization
kderusso
left a comment
There was a problem hiding this comment.
Overall changes look good, but I'll defer to someone more expert in ES|QL and KNN.
|
|
||
| @Override | ||
| protected void appendDoc(BlockLoader.FloatBuilder builder) throws IOException { | ||
| float[] floats = vectorValues.vectorValue(iterator.index()); |
There was a problem hiding this comment.
Could we refactor checking the dimension count out, since it's shared between both block readers?
There was a problem hiding this comment.
Yes, there's a dimension() common method for KnnVectorValues that I wasn't aware of. Done in e98bbcb
...n/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/DenseVectorFieldTypeIT.java
Show resolved
Hide resolved
| public class DenseVectorFieldTypeIT extends AbstractEsqlIntegTestCase { | ||
|
|
||
| private static final Set<String> DENSE_VECTOR_INDEX_TYPES = Set.of( | ||
| public static final Set<String> ALL_DENSE_VECTOR_INDEX_TYPES = Set.of( |
There was a problem hiding this comment.
Would it make sense to use VectorIndexType.values() here to create the set, rather than hard coding the specific index types?
| private final String indexType; | ||
|
|
||
| @ParametersFactory | ||
| public static Iterable<Object[]> parameters() throws Exception { |
There was a problem hiding this comment.
You could simplify this a bit, with something like this (untested):
public static Iterable<Object[]> parameters() throws Exception {
return ALL_DENSE_VECTOR_INDEX_TYPES.stream()
.filter(indexType -> indexType.contains("flat") == false)
.map(indexType -> new Object[] { DenseVectorFieldMapper.ElementType.FLOAT, indexType })
.collect(Collectors.toList());
}
There was a problem hiding this comment.
I find good ol' loops better in terms of clarity here. Also, soon enough we'll need to add the bytes as well.
…vector-support-normalization
…vector-support-normalization # Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/BlockDocValuesReader.java # x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/DenseVectorFieldTypeIT.java # x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/plugin/KnnFunctionIT.java
* Take into account normalization for dense vector support * Fix cherry pick * [CI] Auto commit changes from spotless * Remove debugging code * Check that we may not have magnitudes at all, or for normalized vectors * Better parameterized test * Refactor dimension check * Refactor index names * Remove comment * Fix test --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
* upstream/main: (278 commits) ESQL - dense vector support cosine normalization (elastic#132721) [ML] Add support for dimensions in google vertex ai request (elastic#132689) ESQL - Add byte element support for dense_vector data type (elastic#131863) ESQL: Fix async operator warnings not always sent when blocking (elastic#132744) Method not needed anymore (elastic#132912) [Test] Excercise shutdown more reliably in snapshot stress IT (elastic#132909) Update Gradle shadow plugin to 9.0.1 (elastic#132637) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/410_named_queries/named_queries_with_score} elastic#132906 Update docker.elastic.co/wolfi/chainguard-base-fips:latest Docker digest to fa6cb69 (elastic#132735) Remove unnecessary calls to fold() (elastic#131870) Use consistent terminology for transport version resources/references (elastic#132882) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search.vectors/40_knn_search_cosine/kNN search only regular query} elastic#132890 Finalize release notes for v9.1.2 release (elastic#132745) Finalize release notes for v9.0.5 release (elastic#132718) Move inner records out of TransportVersionUtils (elastic#132872) Add support for Lookup Join on Multiple Fields (elastic#131559) Bootstrap PR-based benchmarks (elastic#132717) Refactor MetadataIndexTemplateService to use template maps instead of project metadata (elastic#132662) [Gradle] Update nebula ospackage plugin to 12.1.0 (elastic#132640) Mute org.elasticsearch.xpack.esql.CsvTests test {csv-spec:ip.CdirMatchEqualsInsOrs} elastic#132860 ...
ESQL did not support cosine normalization. Vectors were retrieved directly from doc values without applying denormalization.
This PR fixes that, including test cases for the different vectors similarities.