LUCENE-10010: don't determinize in CompiledAutomaton/RunAutomaton#485
LUCENE-10010: don't determinize in CompiledAutomaton/RunAutomaton#485rmuir merged 3 commits intoapache:mainfrom
Conversation
Instead, require that incoming automata is determinized by the caller, throwing an exception if it isn't. This paves the way for NFA execution in the future: if you pass an NFA to AutomatonQuery, we should use the NFA algorithm on it. No need for lots of booleans or enums.
| @@ -65,7 +63,7 @@ public class AutomatonQuery extends MultiTermQuery implements Accountable { | |||
| * @param automaton Automaton to run, terms that are accepted are considered a match. | |||
| */ | |||
| public AutomatonQuery(final Term term, Automaton automaton) { | |||
There was a problem hiding this comment.
A thing that I'm a little bit worried about is, once this PR is pushed together with #225 (NFA main PR), then a user that previously using this constructor with NFA will not see an exception but a potential performance regression (due to switching from DFA to NFA)?
I think this is another reason why I tried to make it "hard" to use the AutomatonQuery in NFA mode.
There was a problem hiding this comment.
Well, AutomatonQuery, CompiledAutomaton, RunAutomaton is all low-level stuff. The number of users touching this stuff is low. We can mitigate any risks by documenting the change in MIGRATE.txt, etc.
Instead most users interact with subclasses (e.g. PrefixQuery, WildcardQuery, TermRangeQuery, FuzzyQuery, RegexpQuery) ...
I think if Automaton is to support both DFA and NFA, then it should simply call automaton.isDeterministic() to figure out what to do. Let the subclass control that, e.g. such flags could instead be on RegexpQuery maybe, or we could make a different subclass (depending on how we decide). Special Flags are senseless for a lot of these queries such as TermRangeQuery which always make a DFA.
There was a problem hiding this comment.
I like this approach! +1 to let the more friendly AutomatonQuery subclasses control this, if appropriate. E.g. for PrefixQuery and WildcardQuery, determinize is very low risk and probably a good idea. FuzzyQuery already makes a deterministic automaton, phew.
mikemccand
left a comment
There was a problem hiding this comment.
I like this new approach! I agree this is cleaner for the NFAQuery change. It empowers the user to make the call (determinize up front or not) themselves, before calling the query.
There is some risk that users upgrade and "don't notice" that they must not determinize, but such direct users of AutomatonQuery are very expert and we should call this out clearly in the (new) javadocs.
So +1 to fold this into the NFAQuery PR!
|
The change LGTM, I think we can push this independently and don't need to wait for the main NFA change and it deserves a separate issue as well as a separate CHANGES entry? |
CharacterRunAutomaton used to have a constructor that takes an upper work limit for determinization states. This was changed in apache/lucene#485 and by default now uses 10000 as the limit (in RunAutomaton ctor). This PR determinizes the automaton passed in beforehand where we currently use a higher limit.
The determinization work limit was removed from the contructor with apache/lucene#485 and also its not not optional anymore to pass in whether the automaton is finite or not. Assuming it is not seems to be the right choice according to apache/lucene#11813
This method got a second parameter which is the determinization work limit in apache/lucene#485
Instead, require that incoming automata is determinized by the caller, throwing an exception if it isn't.
This paves the way for NFA execution in the future: if you pass an NFA to AutomatonQuery, we should use the NFA algorithm on it. No need for lots of booleans or enums.
The idea is that we clean this one up and fold this into the main LUCENE-10010 PR, to keep the APIs simple. But we could also merge it independently first.