Skip to content

SOLR-17736: introducing support for KNN search on nested vectors (block join)#3316

Merged
alessandrobenedetti merged 26 commits intoapache:mainfrom
SeaseLtd:feature/SOLR-17736
Dec 18, 2025
Merged

SOLR-17736: introducing support for KNN search on nested vectors (block join)#3316
alessandrobenedetti merged 26 commits intoapache:mainfrom
SeaseLtd:feature/SOLR-17736

Conversation

@alessandrobenedetti
Copy link
Copy Markdown
Contributor

@alessandrobenedetti alessandrobenedetti commented Apr 8, 2025

https://issues.apache.org/jira/browse/SOLR-17736

Description

Introducing support for proper KNN search on children/parents block joins.

Solution

Surface Lucene: apache/lucene#12434

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@github-actions github-actions bot added documentation Improvements or additions to documentation tests cat:search labels Apr 8, 2025
vectorField, queryVector, acceptedChildren, topK, allParentsBitSet);
return new ToParentBlockJoinQuery(
knnChildren, allParentsBitSet, ScoreModeParser.parse(scoreMode));
} else if (isFloatKnnQuery(childrenClauses)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the branching on type of Knn query necessary? As far as I can see, the majority of the code blocks are equivalent and the different instance variables of KnnByteVectorQuery/KnnFloatVectorQuery are being used like its generic parent class

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @liangkaiwen, the parent class org.apache.lucene.search.AbstractKnnVectorQuery is not public and also when building the DiversifyingChildrenFloatKnnVectorQuery/DiversifyingChildrenByteKnnVectorQuery I am afraid the if branching can not be avoided?

I agree there's boring repeated code, not sure we can avoid it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I did not realize the parent class was package private. Maybe there's an argument to be made to make it public so as not to introduce this kind of branching everywhere, but that's a separate lucene question. This is fine

Comment on lines +148 to +150
} catch (IOException e) {
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, e);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious about the addition of a try/catch block here, perhaps we could have a comment w.r.t. what it might catch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just double checked and it's related to the searcher.rewrite and the searcher.getDocSet, they throw IOException, not sure how much we can comment here

Comment on lines +181 to +189
private boolean isFloatKnnQuery(List<BooleanClause> childrenClauses) {
return childrenClauses.size() == 1
&& childrenClauses.get(0).getQuery().getClass().equals(KnnFloatVectorQuery.class);
}

private boolean isByteKnnQuery(List<BooleanClause> childrenClauses) {
return childrenClauses.size() == 1
&& childrenClauses.get(0).getQuery().getClass().equals(KnnByteVectorQuery.class);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if childrenClauses.get(0).getQuery() instances KnnFloatVectorQuery and childrenClauses.get(0).getQuery() instances KnnByteVectorQuery would work too?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if true is returned then later the caller does (repeat) childrenClauses.get(0).getQuery() and type cast instead of check -- perhaps this could be encapsulated

private KnnFloatVectorQuery getFloatKnnQuery(List<BooleanClause> childrenClauses) {
    if (childrenClauses.size() == 1)
    {
        Query query = childrenClauses.get(0).getQuery();
        if (query instanceof KnnFloatVectorQuery)
        {
            return (KnnFloatVectorQuery) query;
        }
    }
    return null;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, take a look if you like it, it may be a little bit less readable but I think it's ok!

alessandrobenedetti and others added 4 commits April 16, 2025 17:44
…QParser.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…QParser.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
@github-actions
Copy link
Copy Markdown

This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the dev@solr.apache.org mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Jun 16, 2025
@alessandrobenedetti
Copy link
Copy Markdown
Contributor Author

I'll resume this soon, some sponsorship should arrive and my agenda looks more relaxed starting from July

@github-actions github-actions bot removed the stale PR not updated in 60 days label Jun 25, 2025
@github-actions
Copy link
Copy Markdown

This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the dev@solr.apache.org mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Aug 24, 2025
@github-actions
Copy link
Copy Markdown

This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again.

@github-actions github-actions bot added the closed-stale Closed after being stale for 60 days label Oct 23, 2025
@alessandrobenedetti
Copy link
Copy Markdown
Contributor Author

I'll resume shortly the work on this!

@github-actions github-actions bot removed closed-stale Closed after being stale for 60 days stale PR not updated in 60 days labels Oct 24, 2025

This may happen, among other use cases, because you chunked the original text into paragraphs, each of them modeled as a nested document with the paragraph text and the vector representation.

Solr doesn't need to have denormalised nested documents, you can still retrieve the children paragraphs by knn vector search and prefilter them using parent level metadata.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solr doesn't need to have denormalised nested documents

That sentence fragment confuses me. What does "denomalized nested documents" mean? Do you mean simply that Solr doesn't support multiValued vector fields? If so, just say that please.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 100%, I wrote it a few months ago but I don't understand it now, a clear sign that a rephrase is necessary.
Possibly I meant to do it as the work in this PR needed to be finished, anyway, I'll spend some time on it very soon, and update the PR (and comments)


[source,text]
----
$ curl 'http://localhost:8983/solr/gettingstarted/select' -d 'omitHeader=true' --data-urlencode 'fq={!child of=$block_mask filters=$parentsFilter}&q={!knn f=childVectorField topK=5}[1.0,2.5,3.0...]' --data-urlencode 'block_mask=(*:* -{!prefix f="_nest_path_" v="/skus/" parentsFilter="name_s:pen"})'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new line per param (thus end the line with \) would be helpful to make this more readable. But mostly I think this complicated query needs an english description preceding it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I'll sort this out and update the PR soon

@alessandrobenedetti
Copy link
Copy Markdown
Contributor Author

I'm resuming the work on this briefly tomorrow, and then at full capacity the week of the 1st of December. Thanks, David, for the pointers. I'll carefully read them and respond soon.

@alessandrobenedetti
Copy link
Copy Markdown
Contributor Author

A first revised draft is fully green after the refactor following @hossman suggestions.

I'll keep it open for a few days, refine it a bit and then merge it unless critical observations!

@alessandrobenedetti alessandrobenedetti merged commit 26457c3 into apache:main Dec 18, 2025
5 checks passed
alessandrobenedetti added a commit that referenced this pull request Dec 18, 2025
…ck join) (#3316)

* first draft

* Only Nested Vectors changes

* first tests draft, parent filter and children filter missing as a test

* tests cleaned

* code cleanup

* draft documentation

* tidy

* tidy

* add best child per document transformer

* add best child per document transformer

* add best child per document transformer

* minor refinement to avoid some instructions

* minor refactor

* Update solr/core/src/java/org/apache/solr/search/join/BlockJoinParentQParser.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>

* Update solr/core/src/java/org/apache/solr/search/join/BlockJoinParentQParser.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>

* new approach following feedback

* tests fixed for the new approach

* tidy + documentation

* tidy + documentation

* tidy + documentation

* tidy + documentation

* minor

* renaming after feedback

---------

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
(cherry picked from commit 26457c3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI cat:search documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants