Skip to content

Adjust ToParentBlockJoin[Byte|Float]KnnVectorQuery to return highest score child doc ID by parent id#12510

Merged
benwtrent merged 3 commits intoapache:mainfrom
benwtrent:feature/fix-parent-block-join-query
Aug 16, 2023
Merged

Adjust ToParentBlockJoin[Byte|Float]KnnVectorQuery to return highest score child doc ID by parent id#12510
benwtrent merged 3 commits intoapache:mainfrom
benwtrent:feature/fix-parent-block-join-query

Conversation

@benwtrent
Copy link
Copy Markdown
Member

While integrating, I discovered a frustrating bug :(

The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).

So, I changed the new ToParentBlockJoin[Byte|Float]KnnVectorQuery to return the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.

Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.

I realize that this might make the name weird. I am happy to consider a new name. All the "join" names are confusing to me already.

I am happy to change the name.

Since this is iterating on an unreleased query and related to: #12434 I am not adding a change log.

Copy link
Copy Markdown
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good, I agree that the naming can be confusing.
Here's some possible alternatives:

  • DiversifyingKnn(Collector|VectorQuery)
  • DiversifyingChildrenKnn...
  • CollapsingKnn...
  • CollapsingChildren...
    Naming is hard.

/** kNN byte vector query that joins matching children vector documents with their parent doc id. */
/**
* kNN byte vector query that joins matching children vector documents with their parent doc id. The
* top documents returned are the child document ids and the calculated scores.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an example on how to mix with root document queries? Something like:

ToParentBlockJoinByteKnnVectorQuery  childQuery = ...
Query query = new ToParentBlockJoinQuery(childQuery, parentsFilter, ..)
...

?

/**
* kNN float vector query that joins matching children vector documents with their parent doc id.
* The top documents returned are the child document ids and the calculated scores.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

@benwtrent benwtrent merged commit 4174b52 into apache:main Aug 16, 2023
@benwtrent benwtrent deleted the feature/fix-parent-block-join-query branch August 16, 2023 17:44
benwtrent added a commit that referenced this pull request Aug 16, 2023
…rn highest score child doc ID by parent id (#12510)

The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).

So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.

Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.

Related to: #12434
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants