LUCENE-10010: don't determinize/minimize in RegExp by rmuir · Pull Request #513 · apache/lucene

rmuir · 2021-12-04T05:34:11Z

Today, RegExp calls minimize() at every parsing step. There is little point to making an NFA execution when it is doing this.

Moreover, some minimize() calls are missing, and in fact in rare cases it can already return an NFA today (for certain syntax)

Instead, RegExp parsing should do none of this, instead it may return a DFA or NFA. NOTE: many simple regexps happen to be still returned as DFA, just because of the algorithms in use.

Callers can decide whether to determinize or minimize. RegExp parsing should not run in exponential time.

All src/java callsites were modified to call minimize(), to prevent any performance problems. minimize() seems unnecessary, but let's approach removing minimization as a separate PR. src/test was fixed to just use determinize() in preparation for this.

This PR is very similar to #485 but just for the regexp parsing.

One annoying change is we can't support true complement operator (~), which is an obscure non-default operator (must be enabled with flag). negated character classes (e.g. [^a-z]) still work as they are contained. But complementing arbitrary automata would require exponential time!

Today, RegExp calls minimize() at every parsing step. There is little point to making an NFA execution when it is doing this. Moreover, some minimize() calls are missing, and in fact in rare cases it can already return an NFA today (for certain syntax) Instead, RegExp parsing should do none of this, instead it may return a DFA or NFA. NOTE: many simple regexps happen to be still returned as DFA, just because of the algorithms in use. Callers can decide whether to determinize or minimize. RegExp parsing should not run in exponential time. All src/java callsites were modified to call minimize(), to prevent any performance problems. minimize() seems unnecessary, but let's approach removing minimization as a separate PR. src/test was fixed to just use determinize() in preparation for this.

It tries to test each symbol/node independently, to make it easier to maintain this code.

zhaih · 2021-12-05T18:37:42Z

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

-        a =
-            Operations.repeat(
-                exp1.toAutomatonInternal(automata, automaton_provider, determinizeWorkLimit));
-        a = MinimizationOperations.minimize(a, determinizeWorkLimit);


I wonder whether there're some boundary cases where people can construct a near 10000 states minimized DFA before but not now? Like if a minimized component have 200 states and is repeated 50 times, but without minimization it would have >200 states and thus the minimization operation on the whole automaton will throw a TooComplexToDeterminize Exception?
But I guess this is not that important, since the user can control the determinizeWorkLimit anyway, and our default 10000 is a relatively high number.

Yes, but I think the behavior is now a good thing. Previously, minimize() (and hence determinize(), too), could be called many times during parsing. The user may have set a work limit of 10000, but it could be exceeded in some cases, where it is called many times, e.g. 5000 + 5000 + 5000 + 5000 + 5000 + 5000.

So I think it is better to let the parser just be a parser and let the caller decide on what to do in a single place (minimize(), determinize(), or maybe nothing at all and use an NFA execution).

zhaih · 2021-12-05T18:43:24Z

Thanks! This really makes NFA change a lot easier, when I was making that change I always complain that we're over-determinizing the automaton everywhere, and this one along with #485 solve the problem!

The change looks good, and I think we can push this independently as well? No need to merge it with #225 since both of them are fairly big PRs.

The new test case now exceeds 90% coverage of the regexp parser.

Since apache/lucene#513 Lucenes RegEx class doesn't minimize or determinize any more and lets the caller decide whether to perform and of these operations. This change uses Operations.determinize instead at the call site.

rmuir requested a review from mikemccand December 4, 2021 05:34

rmuir mentioned this pull request Dec 4, 2021

LUCENE-10010 Introduce NFARunAutomaton to run NFA directly #225

Merged

6 tasks

LUCENE-10010: add new unit test for RegExp parsing

231f801

It tries to test each symbol/node independently, to make it easier to maintain this code.

zhaih reviewed Dec 5, 2021

View reviewed changes

LUCENE-10010: more tests and docs

4ce9286

The new test case now exceeds 90% coverage of the regexp parser.

rmuir merged commit 84e4b85 into apache:main Dec 8, 2021

rmuir mentioned this pull request Jun 20, 2024

Reduce memory use of MinimizationOperations#minimize #13511

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10010: don't determinize/minimize in RegExp#513

LUCENE-10010: don't determinize/minimize in RegExp#513
rmuir merged 3 commits intoapache:mainfrom
rmuir:LUCENE-10010_regexp_determinize

rmuir commented Dec 4, 2021

Uh oh!

zhaih Dec 5, 2021

Uh oh!

rmuir Dec 6, 2021

Uh oh!

zhaih commented Dec 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rmuir commented Dec 4, 2021

Uh oh!

zhaih Dec 5, 2021

Choose a reason for hiding this comment

Uh oh!

rmuir Dec 6, 2021

Choose a reason for hiding this comment

Uh oh!

zhaih commented Dec 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants