LUCENE-10010: don't determinize in CompiledAutomaton/RunAutomaton by rmuir · Pull Request #485 · apache/lucene

rmuir · 2021-11-29T09:26:40Z

Instead, require that incoming automata is determinized by the caller, throwing an exception if it isn't.

This paves the way for NFA execution in the future: if you pass an NFA to AutomatonQuery, we should use the NFA algorithm on it. No need for lots of booleans or enums.

The idea is that we clean this one up and fold this into the main LUCENE-10010 PR, to keep the APIs simple. But we could also merge it independently first.

Instead, require that incoming automata is determinized by the caller, throwing an exception if it isn't. This paves the way for NFA execution in the future: if you pass an NFA to AutomatonQuery, we should use the NFA algorithm on it. No need for lots of booleans or enums.

zhaih · 2021-11-29T23:31:09Z

lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java

@@ -65,7 +63,7 @@ public class AutomatonQuery extends MultiTermQuery implements Accountable {
   * @param automaton Automaton to run, terms that are accepted are considered a match.
   */
  public AutomatonQuery(final Term term, Automaton automaton) {


A thing that I'm a little bit worried about is, once this PR is pushed together with #225 (NFA main PR), then a user that previously using this constructor with NFA will not see an exception but a potential performance regression (due to switching from DFA to NFA)?
I think this is another reason why I tried to make it "hard" to use the AutomatonQuery in NFA mode.

Well, AutomatonQuery, CompiledAutomaton, RunAutomaton is all low-level stuff. The number of users touching this stuff is low. We can mitigate any risks by documenting the change in MIGRATE.txt, etc.

Instead most users interact with subclasses (e.g. PrefixQuery, WildcardQuery, TermRangeQuery, FuzzyQuery, RegexpQuery) ...

I think if Automaton is to support both DFA and NFA, then it should simply call automaton.isDeterministic() to figure out what to do. Let the subclass control that, e.g. such flags could instead be on RegexpQuery maybe, or we could make a different subclass (depending on how we decide). Special Flags are senseless for a lot of these queries such as TermRangeQuery which always make a DFA.

I like this approach! +1 to let the more friendly AutomatonQuery subclasses control this, if appropriate. E.g. for PrefixQuery and WildcardQuery, determinize is very low risk and probably a good idea. FuzzyQuery already makes a deterministic automaton, phew.

mikemccand

I like this new approach! I agree this is cleaner for the NFAQuery change. It empowers the user to make the call (determinize up front or not) themselves, before calling the query.

There is some risk that users upgrade and "don't notice" that they must not determinize, but such direct users of AutomatonQuery are very expert and we should call this out clearly in the (new) javadocs.

So +1 to fold this into the NFAQuery PR!

zhaih · 2021-12-03T19:11:47Z

The change LGTM, I think we can push this independently and don't need to wait for the main NFA change and it deserves a separate issue as well as a separate CHANGES entry?

CharacterRunAutomaton used to have a constructor that takes an upper work limit for determinization states. This was changed in apache/lucene#485 and by default now uses 10000 as the limit (in RunAutomaton ctor). This PR determinizes the automaton passed in beforehand where we currently use a higher limit.

The determinization work limit was removed from the contructor with apache/lucene#485 and also its not not optional anymore to pass in whether the automaton is finite or not. Assuming it is not seems to be the right choice according to apache/lucene#11813

This method got a second parameter which is the determinization work limit in apache/lucene#485

rmuir requested a review from mikemccand November 29, 2021 09:26

rmuir mentioned this pull request Nov 29, 2021

LUCENE-10010 Introduce NFARunAutomaton to run NFA directly #225

Merged

6 tasks

zhaih reviewed Nov 29, 2021

View reviewed changes

mikemccand approved these changes Dec 3, 2021

View reviewed changes

rmuir added 2 commits December 3, 2021 19:13

Merge branch 'main' into LUCENE-10010_stop_det_down_low

f711c4e

LUCENE-10010: javadocs, migrate, CHANGES

0d4768e

rmuir merged commit b2e866b into apache:main Dec 4, 2021

rmuir mentioned this pull request Dec 4, 2021

LUCENE-10010: don't determinize/minimize in RegExp #513

Merged

rmuir mentioned this pull request Jun 20, 2024

Reduce memory use of MinimizationOperations#minimize #13511

Merged

cbuescher added a commit to elastic/elasticsearch that referenced this pull request Aug 28, 2024

Fix Missing WildcardQuery#toAutomaton(Term) method

c9b6ffd

This method got a second parameter which is the determinization work limit in apache/lucene#485

mgodwan mentioned this pull request Jan 1, 2026

[BUG] Create Index Fails for index using simple_pattern tokenizer opensearch-project/OpenSearch#20349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10010: don't determinize in CompiledAutomaton/RunAutomaton#485

LUCENE-10010: don't determinize in CompiledAutomaton/RunAutomaton#485
rmuir merged 3 commits intoapache:mainfrom
rmuir:LUCENE-10010_stop_det_down_low

rmuir commented Nov 29, 2021

Uh oh!

zhaih Nov 29, 2021

Uh oh!

rmuir Nov 29, 2021

Uh oh!

mikemccand Dec 3, 2021

Uh oh!

mikemccand left a comment

Uh oh!

zhaih commented Dec 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rmuir commented Nov 29, 2021

Uh oh!

zhaih Nov 29, 2021

Choose a reason for hiding this comment

Uh oh!

rmuir Nov 29, 2021

Choose a reason for hiding this comment

Uh oh!

mikemccand Dec 3, 2021

Choose a reason for hiding this comment

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

zhaih commented Dec 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants