Skip to content

LUCENE-10010: don't determinize in CompiledAutomaton/RunAutomaton#485

Merged
rmuir merged 3 commits intoapache:mainfrom
rmuir:LUCENE-10010_stop_det_down_low
Dec 4, 2021
Merged

LUCENE-10010: don't determinize in CompiledAutomaton/RunAutomaton#485
rmuir merged 3 commits intoapache:mainfrom
rmuir:LUCENE-10010_stop_det_down_low

Conversation

@rmuir
Copy link
Member

@rmuir rmuir commented Nov 29, 2021

Instead, require that incoming automata is determinized by the caller, throwing an exception if it isn't.

This paves the way for NFA execution in the future: if you pass an NFA to AutomatonQuery, we should use the NFA algorithm on it. No need for lots of booleans or enums.

The idea is that we clean this one up and fold this into the main LUCENE-10010 PR, to keep the APIs simple. But we could also merge it independently first.

Instead, require that incoming automata is determinized by the caller,
throwing an exception if it isn't.

This paves the way for NFA execution in the future: if you pass an NFA
to AutomatonQuery, we should use the NFA algorithm on it. No need for
lots of booleans or enums.
@@ -65,7 +63,7 @@ public class AutomatonQuery extends MultiTermQuery implements Accountable {
* @param automaton Automaton to run, terms that are accepted are considered a match.
*/
public AutomatonQuery(final Term term, Automaton automaton) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thing that I'm a little bit worried about is, once this PR is pushed together with #225 (NFA main PR), then a user that previously using this constructor with NFA will not see an exception but a potential performance regression (due to switching from DFA to NFA)?
I think this is another reason why I tried to make it "hard" to use the AutomatonQuery in NFA mode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, AutomatonQuery, CompiledAutomaton, RunAutomaton is all low-level stuff. The number of users touching this stuff is low. We can mitigate any risks by documenting the change in MIGRATE.txt, etc.

Instead most users interact with subclasses (e.g. PrefixQuery, WildcardQuery, TermRangeQuery, FuzzyQuery, RegexpQuery) ...

I think if Automaton is to support both DFA and NFA, then it should simply call automaton.isDeterministic() to figure out what to do. Let the subclass control that, e.g. such flags could instead be on RegexpQuery maybe, or we could make a different subclass (depending on how we decide). Special Flags are senseless for a lot of these queries such as TermRangeQuery which always make a DFA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach! +1 to let the more friendly AutomatonQuery subclasses control this, if appropriate. E.g. for PrefixQuery and WildcardQuery, determinize is very low risk and probably a good idea. FuzzyQuery already makes a deterministic automaton, phew.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new approach! I agree this is cleaner for the NFAQuery change. It empowers the user to make the call (determinize up front or not) themselves, before calling the query.

There is some risk that users upgrade and "don't notice" that they must not determinize, but such direct users of AutomatonQuery are very expert and we should call this out clearly in the (new) javadocs.

So +1 to fold this into the NFAQuery PR!

@zhaih
Copy link
Contributor

zhaih commented Dec 3, 2021

The change LGTM, I think we can push this independently and don't need to wait for the main NFA change and it deserves a separate issue as well as a separate CHANGES entry?

@rmuir rmuir merged commit b2e866b into apache:main Dec 4, 2021
cbuescher added a commit to elastic/elasticsearch that referenced this pull request Aug 28, 2024
CharacterRunAutomaton used to have a constructor that takes an upper work limit
for determinization states. This was changed in
apache/lucene#485 and by default now uses 10000 as the
limit (in RunAutomaton ctor). This PR determinizes the automaton passed in
beforehand where we currently use a higher limit.
cbuescher added a commit to elastic/elasticsearch that referenced this pull request Aug 28, 2024
The determinization work limit was removed from the contructor with
apache/lucene#485 and also its not not optional anymore
to pass in whether the automaton is finite or not. Assuming it is not seems to
be the right choice according to apache/lucene#11813
cbuescher added a commit to elastic/elasticsearch that referenced this pull request Aug 28, 2024
This method got a second parameter which is the determinization
work limit in apache/lucene#485
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants