Skip to content

LUCENE-10296: Stop minimizing regepx#528

Merged
rmuir merged 1 commit intoapache:mainfrom
rmuir:LUCENE-10296
Dec 9, 2021
Merged

LUCENE-10296: Stop minimizing regepx#528
rmuir merged 1 commit intoapache:mainfrom
rmuir:LUCENE-10296

Conversation

@rmuir
Copy link
Member

@rmuir rmuir commented Dec 8, 2021

In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore.

But now that we have cleaned this up, we can see, what is happening is even worse than just calling determinize(). We still call minimize() which is much crazier and much more.

We stopped doing this for all other AutomatonQuery subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you finishState() we do a lot of practical sorting/coalescing on-the-fly. The practical parts of minimization for runtime perf. Also we added our fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than with UTF-16 representation? So why minimize() ?

Let's just replace minimize() calls with determinize() calls? I've already swapped them out for all of src/test, to get jenkins looking for issues ahead of time.

This change moves Hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's algorithm: that's a 2-liner.

In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore.

But now that we have cleaned this up, we can see it is even worse than just calling determinize(). We still call minimize() which is much crazier and much more.

We stopped doing this for all other AutomatonQuery subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you finishState we do a lot of practical sorting/coalescing on-the-fly. Also we added this fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than it was before? So why minimize() ?

Let's just replace minimize() calls with determinize() calls? I've already swapped them out for all of src/test, to get jenkins looking for issues ahead of time.

This change moves hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's
algorithm.
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting simplification!

@rmuir
Copy link
Member Author

rmuir commented Dec 9, 2021

I'm waiting a bit on https://issues.apache.org/jira/browse/LUCENE-10299 . I don't expect any regression, but I don't want to confuse it with the DocIdSetBuilder stuff

@rmuir
Copy link
Member Author

rmuir commented Dec 9, 2021

LUCENE-10296: Stop minimizing regexps

In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore.

But now that we have cleaned this up, we can see, what is happening is even worse than just calling determinize(). We still call minimize() which is much crazier and much more.

We stopped doing this for all other AutomatonQuery subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you finishState() we do a lot of practical sorting/coalescing on-the-fly. The practical parts of minimization for runtime perf. Also we added our fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than with UTF-16 representation? So why minimize() ?

Let's just replace minimize() calls with determinize() calls? I've already swapped them out for all of src/test, to get jenkins looking for issues ahead of time.

This change moves Hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's algorithm: that's a 2-liner.

I think the problem is understood, longs are insane for docids, I don't wish to hold changes up on stupid stuff....

@rmuir rmuir merged commit 7a872c7 into apache:main Dec 9, 2021
cbuescher added a commit to cbuescher/elasticsearch that referenced this pull request Aug 28, 2024
Lucene 10 moved the MinizationOperations class into the test project and removed
the last use of it in apache/lucene#528 by replacing
minimization of automata by just determinizing them, replacing
MinimizationOperations.minimize by Operations.determinize (see reasoning behind
this in the Lucene PR). This change follows the same pattern where we don't
already know the automaton is already deterministic.
cbuescher added a commit to elastic/elasticsearch that referenced this pull request Aug 28, 2024
Lucene 10 moved the MinizationOperations class into the test project and removed
the last use of it in apache/lucene#528 by replacing
minimization of automata by just determinizing them, replacing
MinimizationOperations.minimize by Operations.determinize (see reasoning behind
this in the Lucene PR). This change follows the same pattern where we don't
already know the automaton is already deterministic.
cbuescher added a commit to cbuescher/elasticsearch that referenced this pull request Oct 11, 2024
Lucene 10 stopped relying in on automaton minimization and moved the underlying
Hopcroft algorithm to test code (for reasoning see
apache/lucene#528). With the upgrade to Lucene 10 we
currently also only determinize automata. The security Automatons utility class
currently contains several methods that sound like they would minimize the
automaton, but this has changed so this PR also changes the method names
accordingly.
cbuescher added a commit to cbuescher/elasticsearch that referenced this pull request Oct 22, 2024
Lucene 10 stopped relying in on automaton minimization and moved the underlying
Hopcroft algorithm to test code (for reasoning see
apache/lucene#528). With the upgrade to Lucene 10 we
currently also only determinize automata. The security Automatons utility class
currently contains several methods that sound like they would minimize the
automaton, but this has changed so this PR also changes the method names
accordingly.
cbuescher added a commit to elastic/elasticsearch that referenced this pull request Oct 22, 2024
Lucene 10 stopped relying in on automaton minimization and moved the underlying
Hopcroft algorithm to test code (for reasoning see
apache/lucene#528). With the upgrade to Lucene 10 we
currently also only determinize automata. The security Automatons utility class
currently contains several methods that sound like they would minimize the
automaton, but this has changed so this PR also changes the method names
accordingly.
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
Lucene 10 stopped relying in on automaton minimization and moved the underlying
Hopcroft algorithm to test code (for reasoning see
apache/lucene#528). With the upgrade to Lucene 10 we
currently also only determinize automata. The security Automatons utility class
currently contains several methods that sound like they would minimize the
automaton, but this has changed so this PR also changes the method names
accordingly.
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
Lucene 10 stopped relying in on automaton minimization and moved the underlying
Hopcroft algorithm to test code (for reasoning see
apache/lucene#528). With the upgrade to Lucene 10 we
currently also only determinize automata. The security Automatons utility class
currently contains several methods that sound like they would minimize the
automaton, but this has changed so this PR also changes the method names
accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants