LUCENE-10296: Stop minimizing regepx by rmuir · Pull Request #528 · apache/lucene

rmuir · 2021-12-08T03:27:27Z

In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore.

But now that we have cleaned this up, we can see, what is happening is even worse than just calling determinize(). We still call minimize() which is much crazier and much more.

We stopped doing this for all other AutomatonQuery subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you finishState() we do a lot of practical sorting/coalescing on-the-fly. The practical parts of minimization for runtime perf. Also we added our fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than with UTF-16 representation? So why minimize() ?

Let's just replace minimize() calls with determinize() calls? I've already swapped them out for all of src/test, to get jenkins looking for issues ahead of time.

This change moves Hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's algorithm: that's a 2-liner.

In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore. But now that we have cleaned this up, we can see it is even worse than just calling determinize(). We still call minimize() which is much crazier and much more. We stopped doing this for all other AutomatonQuery subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you finishState we do a lot of practical sorting/coalescing on-the-fly. Also we added this fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than it was before? So why minimize() ? Let's just replace minimize() calls with determinize() calls? I've already swapped them out for all of src/test, to get jenkins looking for issues ahead of time. This change moves hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's algorithm.

mikemccand

Exciting simplification!

rmuir · 2021-12-09T01:52:35Z

I'm waiting a bit on https://issues.apache.org/jira/browse/LUCENE-10299 . I don't expect any regression, but I don't want to confuse it with the DocIdSetBuilder stuff

rmuir · 2021-12-09T02:44:05Z

LUCENE-10296: Stop minimizing regexps

In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore.

But now that we have cleaned this up, we can see, what is happening is even worse than just calling determinize(). We still call minimize() which is much crazier and much more.

We stopped doing this for all other AutomatonQuery subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you finishState() we do a lot of practical sorting/coalescing on-the-fly. The practical parts of minimization for runtime perf. Also we added our fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than with UTF-16 representation? So why minimize() ?

Let's just replace minimize() calls with determinize() calls? I've already swapped them out for all of src/test, to get jenkins looking for issues ahead of time.

This change moves Hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's algorithm: that's a 2-liner.

I think the problem is understood, longs are insane for docids, I don't wish to hold changes up on stupid stuff....

Lucene 10 moved the MinizationOperations class into the test project and removed the last use of it in apache/lucene#528 by replacing minimization of automata by just determinizing them, replacing MinimizationOperations.minimize by Operations.determinize (see reasoning behind this in the Lucene PR). This change follows the same pattern where we don't already know the automaton is already deterministic.

Lucene 10 stopped relying in on automaton minimization and moved the underlying Hopcroft algorithm to test code (for reasoning see apache/lucene#528). With the upgrade to Lucene 10 we currently also only determinize automata. The security Automatons utility class currently contains several methods that sound like they would minimize the automaton, but this has changed so this PR also changes the method names accordingly.

rmuir requested a review from mikemccand December 8, 2021 03:27

rmuir mentioned this pull request Dec 8, 2021

LUCENE-10010 Introduce NFARunAutomaton to run NFA directly #225

Merged

6 tasks

mikemccand approved these changes Dec 8, 2021

View reviewed changes

rmuir merged commit 7a872c7 into apache:main Dec 9, 2021

asfimport mentioned this pull request Dec 9, 2021

Stop minimizing regexps [LUCENE-10296] #11332

Closed

rmuir mentioned this pull request Jun 20, 2024

Reduce memory use of MinimizationOperations#minimize #13511

Merged

cbuescher mentioned this pull request Oct 11, 2024

Rename methods in o.e.x.c.security.support.Automatons elastic/elasticsearch#114594

Merged

rmuir mentioned this pull request Feb 5, 2025

[BUG] Can (easily) run out of memory with case-insensitive term queries opensearch-project/OpenSearch#16975

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10296: Stop minimizing regepx#528

LUCENE-10296: Stop minimizing regepx#528
rmuir merged 1 commit intoapache:mainfrom
rmuir:LUCENE-10296

rmuir commented Dec 8, 2021 •

edited

Loading

Uh oh!

mikemccand left a comment

Uh oh!

rmuir commented Dec 9, 2021

Uh oh!

rmuir commented Dec 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rmuir commented Dec 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

rmuir commented Dec 9, 2021

Uh oh!

rmuir commented Dec 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rmuir commented Dec 8, 2021 •

edited

Loading