Skip to content

TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection #12176

@rmuir

Description

@rmuir

Description

TermInSetQuery currently "ping-pong" intersects a sorted list against the term dictionary.

Instead of sorted-list, it could possibly use Daciuk Mihov Automaton, which can be built in linear time. Then query could leverage Terms.intersect (e.g. TermInSetQuery could be an AutomatonQuery subclass).

This should give faster intersection of the terms, which is usually the heavy part of this query. For example BlockTree terms dictionary has a very efficient Terms.intersect that makes use of the underlying structure.

The annoying part: DaciukMihovAutomatonBuilder currently requires unicode strings and makes a UTF-32 automaton, which would then be converted to UTF-8 (binary) automaton via UTF32ToUTF8. But I think TermInSetQuery may allow arbitrary non-unicode binary strings?

In order to support arbitrarily binary terms (and to avoid conversions), the DaciukMihov code would have to modified, to support construction of a binary automaton directly. Probably this is actually simpler?

This is just an idea to get more performance, it hasn't been tested. feel free to close the issue if it doesnt work out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions