SOLR-17319: CompoundQueryComponent scribbles continued with CompoundQueryComponentTest#3648
SOLR-17319: CompoundQueryComponent scribbles continued with CompoundQueryComponentTest#3648cpoerschke wants to merge 34 commits intoapache:mainfrom
Conversation
…Component to use it)
…ueryComponent to use it
…deprecate direct access' changes for illustrative purposes
solr/core/src/java/org/apache/solr/handler/component/CompoundResponseBuilder.java
Outdated
Show resolved
Hide resolved
…getParameterPrefix
dsmiley
left a comment
There was a problem hiding this comment.
I freaking love the elegance / simplicity here. It re-uses / leverages the existing distributed search phases of the SearchComponent design (as I hinted in JIRA may be possible), and meanwhile making that design more flexible for this new use-case. And reuses TopDocs.rrf logic :-) I imagine this approach is doing query work in parallel without any extra code? That said, clearly this is a draft / POC. Perhaps there are serious flaws that aren't easily evident in a code review. Testing was minimal. I see no accommodations for faceting to ensure we don't wastefully compute the docSet.
@ercsonusharma Did you see this? Please give this a good review. I suspect the best solution may be a combination of both efforts. For example your (recently redone) tests, documentation, user experience (how to use it) are nice.
I'd be tempted to run both solutions with distributed tracing enabled to a local Zipkin/Jaeger to visualize what's going on. I used to have a shelved code snippet to instrument a test to do this.
solr/core/src/java/org/apache/solr/handler/component/CompoundQueryComponent.java
Outdated
Show resolved
Hide resolved
Yes, that's my understanding too.
Yeah, my understanding is that with this sort of approach there is conceptually no (obvious) support for faceting. Today, Solr's client would make two queries and then outside-of-Solr do the RRF locally -- conceptually that is what the approach here does too. Because the query work is in parallel and the doc sets are (I think) not passed back to where the fusion happens, then truly-correct faceting would not implicitly be available. ... ... having said that, currently the sub-queries are there and there are only sub-queries, so I imagine a fusion-aware faceting component might do an OR of the sub-queries and then send that for faceting purposes only i.e. with |
solr/core/src/java/org/apache/solr/handler/component/CompoundQueryComponent.java
Show resolved
Hide resolved
| final TopDocs[] hits = new TopDocs[crb.responseBuilders.size()]; | ||
| for (int crb_idx = 0; crb_idx < crb.responseBuilders.size(); ++crb_idx) { | ||
|
|
||
| final SolrDocumentList sdl = crb.responseBuilders.get(crb_idx).getResponseDocs(); |
There was a problem hiding this comment.
I am wondering where is this being populated?
| if (rb instanceof CompoundResponseBuilder crb) { | ||
| for (var rb_i : crb.responseBuilders) { | ||
| if (rb_i.isThisFromMe(sreq)) { | ||
| super.handleResponses(rb_i, sreq); |
There was a problem hiding this comment.
Are we using mergeIds on each of the crb and overriding the rb.resultIds?
|
Thanks for the draft @cpoerschke - appreciate it! Below are my initial observations:
I already checked TopDocs.rrf, and it is meant for rrf on TopDocs, especially for Lucene. I find it less effort in creating something ShardDoc.rrf rather than converting the SolrDocumentList to TopcDocs and back.
Query work is already happening in parallel by design of the distributed process. The things that need to be parallelised is merging the docs per query post querying, which I see sequentially here as well. I'm really doubtful at this point in time as to how the faceting & highlighting would work. |
…ghlightComponent to use it)
…et-fields finishStage did it i.e. now it happens after instead of before the fusion)
…ghtComponent to use it)
…nent and other (small) tweaks to match, for deterministic highlighting when document found for both sub-queries;
| @@ -212,27 +212,31 @@ public void handleResponses(ResponseBuilder rb, ShardRequest sreq) {} | |||
|
|
|||
| @Override | |||
| public void finishStage(ResponseBuilder rb) { | |||
There was a problem hiding this comment.
'Hide whitespace' diff mode makes it easier to see that the changes to this method are small (and extremely subtle, should add comments really, later).
|
Thanks @ercsonusharma for starting to take a look here! Yes, there's relatively little code but I totally appreciate that it's not easy to understand.
That's an interesting idea, thanks @dsmiley for sharing! @hossman's "Lifecycle of a Solr Search Request" talk slides and recording from quite-a-while-ago-now also spring to mind for me here. Brief replies to some of your initial observations.
Yes, for each component it will need to be considered how it should behave in a fusion scenario and whether or not it should participate in the fusion stage, or just skip it, which is the implied default.
The distributed process provides parallelism on the shard level e.g. if we have 10 shards then all 10 will run in parallel. Within each shard however, as I understand the current #3418 code, the The parallelism here on #3648 is on the shard level and also on the sub-queries level:
Answering only on the highlighting work at this time, I've advanced the proof-of-concept further so that the test case does cover highlighting. The highlighting test passes and it may help to develop an understanding on how that extra fusion stage fits into the picture.
@dsmiley - I don't suppose you'd be in a position to dig up or dust off that code snippet? |
ca3088d explores this approach albeit without changes to the faceting component. |
|
|
|
||
| public class CompoundResponseBuilder extends ResponseBuilder { | ||
|
|
||
| public static final int STAGE_FUSION = STAGE_GET_FIELDS + 1; |
There was a problem hiding this comment.
Ideally, fusion would come before STAGE_GET_FIELDS so that we don't bother getting fields of docs that don't make the top-X. That's a 2x difference of doc field fetching.
| // undo any 'facets=true' that FacetComponent.modifyRequest might have done | ||
| sreq.params.remove(FacetParams.FACET, "true"); | ||
| sreq.purpose &= ~ShardRequest.PURPOSE_GET_FACETS; |
There was a problem hiding this comment.
This special case is an unfortunate wrinkle of imperfection on the simplicity of the whole approach. But it's just a couple lines.
continuation of the earlier #2597 scribbles
https://issues.apache.org/jira/browse/SOLR-17319