TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection

### Description

TermInSetQuery currently "ping-pong" intersects a sorted list against the term dictionary.

Instead of sorted-list, it could possibly use Daciuk Mihov Automaton, which can be built in linear time. Then query could leverage `Terms.intersect` (e.g. TermInSetQuery could be an AutomatonQuery subclass).

This should give faster intersection of the terms, which is usually the heavy part of this query. For example BlockTree terms dictionary has a very efficient `Terms.intersect` that makes use of the underlying structure.

The annoying part: `DaciukMihovAutomatonBuilder` currently requires unicode strings and makes a UTF-32 automaton, which would then be converted to UTF-8 (binary) automaton via `UTF32ToUTF8`. But I think `TermInSetQuery` may allow arbitrary non-unicode binary strings?

In order to support arbitrarily binary terms (and to avoid conversions), the DaciukMihov code would have to modified, to support construction of a binary automaton directly. Probably this is actually simpler?

This is just an idea to get more performance, it hasn't been tested. feel free to close the issue if it doesnt work out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection #12176

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection #12176

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions