Skip to content

LUCENE-10010: don't determinize/minimize in RegExp#513

Merged
rmuir merged 3 commits intoapache:mainfrom
rmuir:LUCENE-10010_regexp_determinize
Dec 8, 2021
Merged

LUCENE-10010: don't determinize/minimize in RegExp#513
rmuir merged 3 commits intoapache:mainfrom
rmuir:LUCENE-10010_regexp_determinize

Conversation

@rmuir
Copy link
Member

@rmuir rmuir commented Dec 4, 2021

Today, RegExp calls minimize() at every parsing step. There is little point to making an NFA execution when it is doing this.

Moreover, some minimize() calls are missing, and in fact in rare cases it can already return an NFA today (for certain syntax)

Instead, RegExp parsing should do none of this, instead it may return a DFA or NFA. NOTE: many simple regexps happen to be still returned as DFA, just because of the algorithms in use.

Callers can decide whether to determinize or minimize. RegExp parsing should not run in exponential time.

All src/java callsites were modified to call minimize(), to prevent any performance problems. minimize() seems unnecessary, but let's approach removing minimization as a separate PR. src/test was fixed to just use determinize() in preparation for this.

This PR is very similar to #485 but just for the regexp parsing.

One annoying change is we can't support true complement operator (~), which is an obscure non-default operator (must be enabled with flag). negated character classes (e.g. [^a-z]) still work as they are contained. But complementing arbitrary automata would require exponential time!

Today, RegExp calls minimize() at every parsing step. There is little
point to making an NFA execution when it is doing this.

Moreover, some minimize() calls are missing, and in fact in rare cases
it can already return an NFA today (for certain syntax)

Instead, RegExp parsing should do none of this, instead it may return a
DFA or NFA. NOTE: many simple regexps happen to be still returned as DFA, just
because of the algorithms in use.

Callers can decide whether to determinize or minimize. RegExp parsing should
not run in exponential time.

All src/java callsites were modified to call minimize(), to prevent any
performance problems. minimize() seems unnecessary, but let's approach
removing minimization as a separate PR. src/test was fixed to just use
determinize() in preparation for this.
It tries to test each symbol/node independently, to make it easier to maintain this code.
a =
Operations.repeat(
exp1.toAutomatonInternal(automata, automaton_provider, determinizeWorkLimit));
a = MinimizationOperations.minimize(a, determinizeWorkLimit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether there're some boundary cases where people can construct a near 10000 states minimized DFA before but not now? Like if a minimized component have 200 states and is repeated 50 times, but without minimization it would have >200 states and thus the minimization operation on the whole automaton will throw a TooComplexToDeterminize Exception?
But I guess this is not that important, since the user can control the determinizeWorkLimit anyway, and our default 10000 is a relatively high number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think the behavior is now a good thing. Previously, minimize() (and hence determinize(), too), could be called many times during parsing. The user may have set a work limit of 10000, but it could be exceeded in some cases, where it is called many times, e.g. 5000 + 5000 + 5000 + 5000 + 5000 + 5000.

So I think it is better to let the parser just be a parser and let the caller decide on what to do in a single place (minimize(), determinize(), or maybe nothing at all and use an NFA execution).

@zhaih
Copy link
Contributor

zhaih commented Dec 5, 2021

Thanks! This really makes NFA change a lot easier, when I was making that change I always complain that we're over-determinizing the automaton everywhere, and this one along with #485 solve the problem!

The change looks good, and I think we can push this independently as well? No need to merge it with #225 since both of them are fairly big PRs.

The new test case now exceeds 90% coverage of the regexp parser.
@rmuir rmuir merged commit 84e4b85 into apache:main Dec 8, 2021
cbuescher added a commit to elastic/elasticsearch that referenced this pull request Aug 28, 2024
Since apache/lucene#513 Lucenes RegEx class doesn't
minimize or determinize any more and lets the caller decide whether to perform
and of these operations. This change uses Operations.determinize instead at
the call site.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants