Conversation
In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore. But now that we have cleaned this up, we can see it is even worse than just calling determinize(). We still call minimize() which is much crazier and much more. We stopped doing this for all other AutomatonQuery subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you finishState we do a lot of practical sorting/coalescing on-the-fly. Also we added this fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than it was before? So why minimize() ? Let's just replace minimize() calls with determinize() calls? I've already swapped them out for all of src/test, to get jenkins looking for issues ahead of time. This change moves hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's algorithm.
|
I'm waiting a bit on https://issues.apache.org/jira/browse/LUCENE-10299 . I don't expect any regression, but I don't want to confuse it with the |
|
LUCENE-10296: Stop minimizing regexps In current trunk, we let caller (e.g. But now that we have cleaned this up, we can see, what is happening is even worse than just calling We stopped doing this for all other Let's just replace This change moves Hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's algorithm: that's a 2-liner. I think the problem is understood, longs are insane for docids, I don't wish to hold changes up on stupid stuff.... |
Lucene 10 moved the MinizationOperations class into the test project and removed the last use of it in apache/lucene#528 by replacing minimization of automata by just determinizing them, replacing MinimizationOperations.minimize by Operations.determinize (see reasoning behind this in the Lucene PR). This change follows the same pattern where we don't already know the automaton is already deterministic.
Lucene 10 moved the MinizationOperations class into the test project and removed the last use of it in apache/lucene#528 by replacing minimization of automata by just determinizing them, replacing MinimizationOperations.minimize by Operations.determinize (see reasoning behind this in the Lucene PR). This change follows the same pattern where we don't already know the automaton is already deterministic.
Lucene 10 stopped relying in on automaton minimization and moved the underlying Hopcroft algorithm to test code (for reasoning see apache/lucene#528). With the upgrade to Lucene 10 we currently also only determinize automata. The security Automatons utility class currently contains several methods that sound like they would minimize the automaton, but this has changed so this PR also changes the method names accordingly.
Lucene 10 stopped relying in on automaton minimization and moved the underlying Hopcroft algorithm to test code (for reasoning see apache/lucene#528). With the upgrade to Lucene 10 we currently also only determinize automata. The security Automatons utility class currently contains several methods that sound like they would minimize the automaton, but this has changed so this PR also changes the method names accordingly.
Lucene 10 stopped relying in on automaton minimization and moved the underlying Hopcroft algorithm to test code (for reasoning see apache/lucene#528). With the upgrade to Lucene 10 we currently also only determinize automata. The security Automatons utility class currently contains several methods that sound like they would minimize the automaton, but this has changed so this PR also changes the method names accordingly.
Lucene 10 stopped relying in on automaton minimization and moved the underlying Hopcroft algorithm to test code (for reasoning see apache/lucene#528). With the upgrade to Lucene 10 we currently also only determinize automata. The security Automatons utility class currently contains several methods that sound like they would minimize the automaton, but this has changed so this PR also changes the method names accordingly.
Lucene 10 stopped relying in on automaton minimization and moved the underlying Hopcroft algorithm to test code (for reasoning see apache/lucene#528). With the upgrade to Lucene 10 we currently also only determinize automata. The security Automatons utility class currently contains several methods that sound like they would minimize the automaton, but this has changed so this PR also changes the method names accordingly.
In current trunk, we let caller (e.g.
RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore.But now that we have cleaned this up, we can see, what is happening is even worse than just calling
determinize(). We still callminimize()which is much crazier and much more.We stopped doing this for all other
AutomatonQuerysubclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when youfinishState()we do a lot of practical sorting/coalescing on-the-fly. The practical parts of minimization for runtime perf. Also we added our fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than with UTF-16 representation? So whyminimize()?Let's just replace
minimize()calls withdeterminize()calls? I've already swapped them out for all ofsrc/test, to get jenkins looking for issues ahead of time.This change moves Hopcroft minimization (
MinimizeOperations) tosrc/testfor now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's algorithm: that's a 2-liner.