Skip to content

[PoC] Push vector similarity functions using BlockDocValuesReader#136739

Closed
carlosdelest wants to merge 11 commits intoelastic:mainfrom
carlosdelest:non-issue/esql-push-vector-similarity-blockloaders
Closed

[PoC] Push vector similarity functions using BlockDocValuesReader#136739
carlosdelest wants to merge 11 commits intoelastic:mainfrom
carlosdelest:non-issue/esql-push-vector-similarity-blockloaders

Conversation

@carlosdelest
Copy link
Copy Markdown
Member

Pushes down vector similarity functions, and applies them when loading vectors via BlockDocValuesReader.

This avoids the overhead of loading vectors into blocks, and calculate the functions afterwards in the compute engine. This directly calculates the similarity functions when loading the field, and loads the results directly into a block.

This approach consists on:

  • A new FieldExtractPreference: FUNCTION. This signals that the field must be extracted using a function.
  • A BlockLoaderValueFunction interface, that is available on the BlockLoaderContext for fields that are extracted via FUNCTION. This interface defines both the Builder that will be used as a result of the transformation, and the transformation itself.
  • This function is applied via a DenseVectorValueBlockLoader, that is used when a FUNCTION is used to extract a dense vector. It applies the transformation and loads the results.
  • A FieldFunctionAttribute that is used to extract a FieldAttribute with a function.
  • A VectorSimilarityFunctionsPushdown local physical optimization rule, that replaces a vector similarity function applied to a field with a FieldFunctionAttribute that will be used to extract the function result at field loading.

Work still pending:

  • VectorSimilarityFunctionsPushdown needs to take into account that the field may still need to be extracted as it may be a reference in other places of the query
  • Replace the current dense vector loading so it reuses the DenseVectorValueBlockLoader - vector loading could be done using a BlockLoaderValueFunction that just adds the vectors to the block. This will reduce code duplication and use a single code path for loading a field.
  • Testing

@carlosdelest carlosdelest added >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL :Search Relevance/ES|QL Search functionality in ES|QL v9.3.0 labels Oct 17, 2025
@nik9000
Copy link
Copy Markdown
Member

nik9000 commented Oct 21, 2025

A new FieldExtractPreference: FUNCTION. This signals that the field must be extracted using a function.

I'd prefer if we didn't modify the preference for loading. I'd prefer to just attach the function to the configuration and respect it if it is there. And, I suppose, we should assert that the loaded thinks it is loading for the passed function.

float calculateSimilarity(float[] v1, float[] v2);

float calculateSimilarity(byte[] v1, byte[] v2);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch out for these! They'll be super fast if you only have one subclass. Or two. But as soon as you have three it'll start getting slower. Fine for now, but might be a big deal once you start really getting faster.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We have more than 3 similarity functions so I don't think we'll be able to avoid going megamorphic here. Is there any alternative that you can think of?


@Nullable
default BlockLoaderValueFunction<?, ?> blockLoaderValueFunction() {
throw new UnsupportedOperationException("blockLoaderValueFunction is not supported");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be harder to make mistake with this if we didn't default implement it. Though I'd be happy with a "for testing" version of this thing that throws from every method so we don't have to change every single test in follow-up PRs.

Also! Can this return null if there's no function to apply?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be harder to make mistake with this if we didn't default implement it.

Yep, my thinking was to reduce the blast radius for this at least at PoC level. I can remove the default when switching to ready for review 👍

Also! Can this return null if there's no function to apply?

I will do that, as I was going to signal this via the FUNCTION extract preference - but I understand your concerns there.

/**
* Applies the function to the value passed as parameter, and adds it to the builder that was returned by {@link #builder(BlockLoader.BlockFactory, int)}.
*/
void applyFunctionAndAdd(T value, B builder) throws IOException;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we wouldn't actually have any code in this. Or maybe just a:

String name();

Object config();

And the block loaders would detect it and specialize. I think the point of this is that all block loaders would have hand-optimized code for each case. That's what we're looking to do for the vector cases at least.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're thinking in directly providing the block loader at this level, that handles both retrieving the values and transforming them. I see your point.

I was thinking that there will be repetitive stuff in terms of retrieving the original values, and that it would be better to add the transformation on top of that - but we can surely do that via subclassing a base class.

I'll try this approach, I was not happy with how this interface looked anyway. Thanks!

* the value in a way that is more efficient than loading the raw field value
* and applying the function to it.
*/
public class FieldFunctionAttribute extends FieldAttribute {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't tend to make new ones of these so I'm never sure what's right. We'll need @alex-spies to comment on this one.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is basically to tell how to extract fields when we want to transform them, and it seemed natural to use a new field that can provide a block loader.

I'm currently iterating on this idea. It will keep the FieldFunctionAttribute class, but will include a Function reference on it. It's the Function the one that can implement the block loader providing method.

case DOC_VALUES -> throw new EsqlIllegalArgumentException("Illegal field extract preference: " + fieldExtractPreference);
case DOC_VALUES, FUNCTION -> throw new EsqlIllegalArgumentException(
"Illegal field extract preference: " + fieldExtractPreference
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping we could get this "for free" by checking on the return BlockLoader. I was thinking it could return the function that it's applying - which would be null for all existing functions and the new one could be your functions. And then we could assert in the test and the prod loader code that they were equal.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there's a bunch of field types. Lots of implementations - and doing it in the caller makes sure any failure to take into account the requested field to load comes back as a bug.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see your point. We could separate as well the where we get the field data to the how we transform it. They're orthogonal concepts, I will change this. Thanks for the feedback, makes sense!

private EvalExec replaceSimilarityFunctionsForFieldTransformations(EvalExec eval) {
Map<Attribute, Attribute> replacements = new HashMap<>();
EvalExec resultEval = (EvalExec) eval.transformExpressionsDown(VectorSimilarityFunction.class, similarityFunction -> {
if (similarityFunction.left() instanceof Literal ^ similarityFunction.right() instanceof Literal) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we always rotate the literal to the right hand side - though maybe that's not done yet?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will double check, thanks!

@SuppressWarnings("unchecked")
private EvalExec replaceSimilarityFunctionsForFieldTransformations(EvalExec eval) {
Map<Attribute, Attribute> replacements = new HashMap<>();
EvalExec resultEval = (EvalExec) eval.transformExpressionsDown(VectorSimilarityFunction.class, similarityFunction -> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to make this generic over other function. But I can grab that after you've landed this one. Or Alex or someone. This is a fine way to get it in, I think.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm trying to set the path for that. I'm currently thinking of making a specific interface for providing block loaders that can be implemented by functions, so I think that would be pretty straightforward.

@carlosdelest
Copy link
Copy Markdown
Member Author

Closing in favor of #137002

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >non-issue :Search Relevance/ES|QL Search functionality in ES|QL Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants