[PoC] Push vector similarity functions using BlockDocValuesReader#136739
[PoC] Push vector similarity functions using BlockDocValuesReader#136739carlosdelest wants to merge 11 commits intoelastic:mainfrom
Conversation
…ectorFieldMapper and BlockDocValuesReader
I'd prefer if we didn't modify the preference for loading. I'd prefer to just attach the function to the configuration and respect it if it is there. And, I suppose, we should assert that the loaded thinks it is loading for the passed function. |
| float calculateSimilarity(float[] v1, float[] v2); | ||
|
|
||
| float calculateSimilarity(byte[] v1, byte[] v2); | ||
| } |
There was a problem hiding this comment.
Watch out for these! They'll be super fast if you only have one subclass. Or two. But as soon as you have three it'll start getting slower. Fine for now, but might be a big deal once you start really getting faster.
There was a problem hiding this comment.
I see. We have more than 3 similarity functions so I don't think we'll be able to avoid going megamorphic here. Is there any alternative that you can think of?
|
|
||
| @Nullable | ||
| default BlockLoaderValueFunction<?, ?> blockLoaderValueFunction() { | ||
| throw new UnsupportedOperationException("blockLoaderValueFunction is not supported"); |
There was a problem hiding this comment.
I think it'd be harder to make mistake with this if we didn't default implement it. Though I'd be happy with a "for testing" version of this thing that throws from every method so we don't have to change every single test in follow-up PRs.
Also! Can this return null if there's no function to apply?
There was a problem hiding this comment.
I think it'd be harder to make mistake with this if we didn't default implement it.
Yep, my thinking was to reduce the blast radius for this at least at PoC level. I can remove the default when switching to ready for review 👍
Also! Can this return null if there's no function to apply?
I will do that, as I was going to signal this via the FUNCTION extract preference - but I understand your concerns there.
| /** | ||
| * Applies the function to the value passed as parameter, and adds it to the builder that was returned by {@link #builder(BlockLoader.BlockFactory, int)}. | ||
| */ | ||
| void applyFunctionAndAdd(T value, B builder) throws IOException; |
There was a problem hiding this comment.
I was thinking we wouldn't actually have any code in this. Or maybe just a:
String name();
Object config();
And the block loaders would detect it and specialize. I think the point of this is that all block loaders would have hand-optimized code for each case. That's what we're looking to do for the vector cases at least.
There was a problem hiding this comment.
So you're thinking in directly providing the block loader at this level, that handles both retrieving the values and transforming them. I see your point.
I was thinking that there will be repetitive stuff in terms of retrieving the original values, and that it would be better to add the transformation on top of that - but we can surely do that via subclassing a base class.
I'll try this approach, I was not happy with how this interface looked anyway. Thanks!
| * the value in a way that is more efficient than loading the raw field value | ||
| * and applying the function to it. | ||
| */ | ||
| public class FieldFunctionAttribute extends FieldAttribute { |
There was a problem hiding this comment.
I don't tend to make new ones of these so I'm never sure what's right. We'll need @alex-spies to comment on this one.
There was a problem hiding this comment.
The idea is basically to tell how to extract fields when we want to transform them, and it seemed natural to use a new field that can provide a block loader.
I'm currently iterating on this idea. It will keep the FieldFunctionAttribute class, but will include a Function reference on it. It's the Function the one that can implement the block loader providing method.
| case DOC_VALUES -> throw new EsqlIllegalArgumentException("Illegal field extract preference: " + fieldExtractPreference); | ||
| case DOC_VALUES, FUNCTION -> throw new EsqlIllegalArgumentException( | ||
| "Illegal field extract preference: " + fieldExtractPreference | ||
| ); |
There was a problem hiding this comment.
I was hoping we could get this "for free" by checking on the return BlockLoader. I was thinking it could return the function that it's applying - which would be null for all existing functions and the new one could be your functions. And then we could assert in the test and the prod loader code that they were equal.
There was a problem hiding this comment.
Because there's a bunch of field types. Lots of implementations - and doing it in the caller makes sure any failure to take into account the requested field to load comes back as a bug.
There was a problem hiding this comment.
Yes, I see your point. We could separate as well the where we get the field data to the how we transform it. They're orthogonal concepts, I will change this. Thanks for the feedback, makes sense!
| private EvalExec replaceSimilarityFunctionsForFieldTransformations(EvalExec eval) { | ||
| Map<Attribute, Attribute> replacements = new HashMap<>(); | ||
| EvalExec resultEval = (EvalExec) eval.transformExpressionsDown(VectorSimilarityFunction.class, similarityFunction -> { | ||
| if (similarityFunction.left() instanceof Literal ^ similarityFunction.right() instanceof Literal) { |
There was a problem hiding this comment.
I think we always rotate the literal to the right hand side - though maybe that's not done yet?
There was a problem hiding this comment.
Will double check, thanks!
| @SuppressWarnings("unchecked") | ||
| private EvalExec replaceSimilarityFunctionsForFieldTransformations(EvalExec eval) { | ||
| Map<Attribute, Attribute> replacements = new HashMap<>(); | ||
| EvalExec resultEval = (EvalExec) eval.transformExpressionsDown(VectorSimilarityFunction.class, similarityFunction -> { |
There was a problem hiding this comment.
I'd love to make this generic over other function. But I can grab that after you've landed this one. Or Alex or someone. This is a fine way to get it in, I think.
There was a problem hiding this comment.
Yes, I'm trying to set the path for that. I'm currently thinking of making a specific interface for providing block loaders that can be implemented by functions, so I think that would be pretty straightforward.
|
Closing in favor of #137002 |
Pushes down vector similarity functions, and applies them when loading vectors via
BlockDocValuesReader.This avoids the overhead of loading vectors into blocks, and calculate the functions afterwards in the compute engine. This directly calculates the similarity functions when loading the field, and loads the results directly into a block.
This approach consists on:
FieldExtractPreference:FUNCTION. This signals that the field must be extracted using a function.BlockLoaderValueFunctioninterface, that is available on theBlockLoaderContextfor fields that are extracted viaFUNCTION. This interface defines both the Builder that will be used as a result of the transformation, and the transformation itself.DenseVectorValueBlockLoader, that is used when aFUNCTIONis used to extract a dense vector. It applies the transformation and loads the results.FieldFunctionAttributethat is used to extract aFieldAttributewith a function.VectorSimilarityFunctionsPushdownlocal physical optimization rule, that replaces a vector similarity function applied to a field with aFieldFunctionAttributethat will be used to extract the function result at field loading.Work still pending:
VectorSimilarityFunctionsPushdownneeds to take into account that the field may still need to be extracted as it may be a reference in other places of the queryDenseVectorValueBlockLoader- vector loading could be done using aBlockLoaderValueFunctionthat just adds the vectors to the block. This will reduce code duplication and use a single code path for loading a field.