Adjust ToParentBlockJoin[Byte|Float]KnnVectorQuery to return highest score child doc ID by parent id#12510
Merged
benwtrent merged 3 commits intoapache:mainfrom Aug 16, 2023
Conversation
…score child doc ID by parent id
jimczi
approved these changes
Aug 16, 2023
Contributor
jimczi
left a comment
There was a problem hiding this comment.
The change looks good, I agree that the naming can be confusing.
Here's some possible alternatives:
- DiversifyingKnn(Collector|VectorQuery)
- DiversifyingChildrenKnn...
- CollapsingKnn...
- CollapsingChildren...
Naming is hard.
| /** kNN byte vector query that joins matching children vector documents with their parent doc id. */ | ||
| /** | ||
| * kNN byte vector query that joins matching children vector documents with their parent doc id. The | ||
| * top documents returned are the child document ids and the calculated scores. |
Contributor
There was a problem hiding this comment.
Maybe add an example on how to mix with root document queries? Something like:
ToParentBlockJoinByteKnnVectorQuery childQuery = ...
Query query = new ToParentBlockJoinQuery(childQuery, parentsFilter, ..)
...
?
| /** | ||
| * kNN float vector query that joins matching children vector documents with their parent doc id. | ||
| * The top documents returned are the child document ids and the calculated scores. | ||
| */ |
benwtrent
added a commit
that referenced
this pull request
Aug 16, 2023
…rn highest score child doc ID by parent id (#12510) The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search). So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id. Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set. Related to: #12434
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While integrating, I discovered a frustrating bug :(
The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).
So, I changed the new
ToParentBlockJoin[Byte|Float]KnnVectorQueryto return the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.
I realize that this might make the name weird. I am happy to consider a new name. All the "join" names are confusing to me already.
I am happy to change the name.
Since this is iterating on an unreleased query and related to: #12434 I am not adding a change log.