ESQL: dense_vector cosine similarity function#130641
ESQL: dense_vector cosine similarity function#130641carlosdelest merged 35 commits intoelastic:mainfrom
Conversation
| FloatBlock leftBlock = (FloatBlock) left.get(context).eval(page); | ||
| FloatBlock rightBlock = (FloatBlock) right.get(context).eval(page) | ||
| ) { | ||
| int positionCount = page.getPositionCount(); |
There was a problem hiding this comment.
@ChrisHegarty I'm wondering if this is the right way to provide an evaluation for dense_vector based operations. Besides vector similarity functions, we will create vector operations (add, substract, dot product, etc).
Do you think we should create the necessary infrastructure for template based evaluators, or should having this ad-hoc evaluation work?
Is there anything we should be careful about when doing ad-hoc evaluation for vectorization purposes?
There was a problem hiding this comment.
I think that this is fine. The vectorization that we are looking for here is in the comparison operation itself, so when comparing float[]'s. Ultimately tho, we would want to be able to compare against mmap'ed off-heap data, but that is completely separate and can come later - since it would require a block backed by a memory segment. We had similar(ish), though different, with big array blocks. Would need to re-check the details.
| if (f instanceof In in) { | ||
| return processIn(in); | ||
| } | ||
| if (f instanceof VectorFunction) { |
There was a problem hiding this comment.
Needed to change the order to ensure VectorFunction are processed first, as similarity functions are scalar functions as well
| required_capability: cosine_vector_similarity_function | ||
|
|
||
| row vector = [1, 2, 3] | ||
| | eval similarity = round(v_cosine(vector, [0, 1, 2]), 3) |
There was a problem hiding this comment.
For this to work properly, we need to implement a conversion function so we can convert non-foldable values to dense_vector.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
| /** | ||
| * Defines the named writables for vector functions in ESQL. | ||
| */ | ||
| public final class VectorWritables { |
There was a problem hiding this comment.
Not sure if we need this utility class just yet, but I'll assume you have plans to add more :)
There was a problem hiding this comment.
Haha yeah, it's a bit premature yet - but we will be adding a number of vector similarity functions soon enough, and I wanted to provide places where it would be easy to look for them.
| } | ||
| var wrapper = BlockUtils.wrapperFor(blockFactory, ElementType.fromJava(multiValue.get(0).getClass()), positions); | ||
| // dense_vector create internally float values, even if they are specified as doubles | ||
| ElementType elementType = lit.dataType() == DataType.DENSE_VECTOR |
There was a problem hiding this comment.
Should this logic be in its own method?
There was a problem hiding this comment.
I'd say no as this is a one-liner for getting the correct ElementType - there's no more logic than doing a specific check for dense_vector. I'd say, ff more special cases come into play then let's add it as it will become confusing.
|
|
||
| import static org.apache.lucene.index.VectorSimilarityFunction.COSINE; | ||
|
|
||
| public class CosineSimilarity extends VectorSimilarityFunction { |
There was a problem hiding this comment.
Do you need to subclass different types of functions here? Why not just have an enum which specifies the type in VectorSimilarityFunction?
There was a problem hiding this comment.
Good point - I think this aligns better with the current way ESQL functions work. I'm not sure that docs generation work with enums as of now as well.
Happy to review this when adding more functions though!
…ch-functions-basics' into non-issue/esql-vector-search-functions-basics
…-search-functions-basics
ioanatia
left a comment
There was a problem hiding this comment.
are we missing the docs that will be generated for the v_cosine function?
otherwise LGTM
…milarityFunction to extend BinaryScalarFunction
🔍 Preview links for changed docs |
| /** | ||
| * Base class for vector similarity functions, which compute a similarity score between two dense vectors | ||
| */ | ||
| public abstract class VectorSimilarityFunction extends BinaryScalarFunction implements EvaluatorMapper, VectorFunction { |
There was a problem hiding this comment.
Now VectorSimilarityFunction extends BinaryScalarFunction. That brings some simplifications to the code as we already have two params.
| public void testDenseVectorImplicitCastingSimilarityFunctions() { | ||
| if (EsqlCapabilities.Cap.COSINE_VECTOR_SIMILARITY_FUNCTION.isEnabled()) { | ||
| checkDenseVectorImplicitCastingSimilarityFunction("v_cosine(vector, [0.342, 0.164, 0.234])", List.of(0.342f, 0.164f, 0.234f)); | ||
| checkDenseVectorImplicitCastingSimilarityFunction("v_cosine(vector, [1, 2, 3])", List.of(1f, 2f, 3f)); |
There was a problem hiding this comment.
Checks casting is done for non-float values, and creates a float Literal
| import static org.elasticsearch.xpack.esql.core.type.DataType.DOUBLE; | ||
| import static org.hamcrest.Matchers.equalTo; | ||
|
|
||
| public abstract class AbstractVectorSimilarityFunctionTestCase extends AbstractScalarFunctionTestCase { |
There was a problem hiding this comment.
New test case added that extends AbstractScalarFunctionTestCase. This brings quite a few tests like checking what happens with null values, evaluator type checks, etc.
| import java.util.function.Supplier; | ||
|
|
||
| @FunctionName("v_cosine") | ||
| public class CosineSimilarityTests extends AbstractVectorSimilarityFunctionTestCase { |
There was a problem hiding this comment.
New functions test cases should be simple, all the heavy lifting is done in the abstract class
@ioanatia 🤦 yes we were. There were no |
…-search-functions-basics
…ch-functions-basics' into non-issue/esql-vector-search-functions-basics
tracked in #130828
Implements
CosineSimilarityFunctionfor ES|QL, and adds basic infrastructure for other vector similarity functions.Adds a base class,
VectorSimilarityFunction, that provides the building block for vector similarity functions.There are pending validations that should be done for the function parameters:
We can work on these validations as follow ups, as they may depend on field_caps API returning that information.