LUCENE-10010: don't determinize/minimize in RegExp#513
Conversation
Today, RegExp calls minimize() at every parsing step. There is little point to making an NFA execution when it is doing this. Moreover, some minimize() calls are missing, and in fact in rare cases it can already return an NFA today (for certain syntax) Instead, RegExp parsing should do none of this, instead it may return a DFA or NFA. NOTE: many simple regexps happen to be still returned as DFA, just because of the algorithms in use. Callers can decide whether to determinize or minimize. RegExp parsing should not run in exponential time. All src/java callsites were modified to call minimize(), to prevent any performance problems. minimize() seems unnecessary, but let's approach removing minimization as a separate PR. src/test was fixed to just use determinize() in preparation for this.
It tries to test each symbol/node independently, to make it easier to maintain this code.
| a = | ||
| Operations.repeat( | ||
| exp1.toAutomatonInternal(automata, automaton_provider, determinizeWorkLimit)); | ||
| a = MinimizationOperations.minimize(a, determinizeWorkLimit); |
There was a problem hiding this comment.
I wonder whether there're some boundary cases where people can construct a near 10000 states minimized DFA before but not now? Like if a minimized component have 200 states and is repeated 50 times, but without minimization it would have >200 states and thus the minimization operation on the whole automaton will throw a TooComplexToDeterminize Exception?
But I guess this is not that important, since the user can control the determinizeWorkLimit anyway, and our default 10000 is a relatively high number.
There was a problem hiding this comment.
Yes, but I think the behavior is now a good thing. Previously, minimize() (and hence determinize(), too), could be called many times during parsing. The user may have set a work limit of 10000, but it could be exceeded in some cases, where it is called many times, e.g. 5000 + 5000 + 5000 + 5000 + 5000 + 5000.
So I think it is better to let the parser just be a parser and let the caller decide on what to do in a single place (minimize(), determinize(), or maybe nothing at all and use an NFA execution).
|
Thanks! This really makes NFA change a lot easier, when I was making that change I always complain that we're over-determinizing the automaton everywhere, and this one along with #485 solve the problem! The change looks good, and I think we can push this independently as well? No need to merge it with #225 since both of them are fairly big PRs. |
The new test case now exceeds 90% coverage of the regexp parser.
Since apache/lucene#513 Lucenes RegEx class doesn't minimize or determinize any more and lets the caller decide whether to perform and of these operations. This change uses Operations.determinize instead at the call site.
Today, RegExp calls minimize() at every parsing step. There is little point to making an NFA execution when it is doing this.
Moreover, some minimize() calls are missing, and in fact in rare cases it can already return an NFA today (for certain syntax)
Instead, RegExp parsing should do none of this, instead it may return a DFA or NFA. NOTE: many simple regexps happen to be still returned as DFA, just because of the algorithms in use.
Callers can decide whether to determinize or minimize. RegExp parsing should not run in exponential time.
All src/java callsites were modified to call minimize(), to prevent any performance problems. minimize() seems unnecessary, but let's approach removing minimization as a separate PR. src/test was fixed to just use determinize() in preparation for this.
This PR is very similar to #485 but just for the regexp parsing.
One annoying change is we can't support true complement operator (~), which is an obscure non-default operator (must be enabled with flag). negated character classes (e.g.
[^a-z]) still work as they are contained. But complementing arbitrary automata would require exponential time!