LUCENE-10010 Introduce NFARunAutomaton to run NFA directly#225
LUCENE-10010 Introduce NFARunAutomaton to run NFA directly#225zhaih merged 21 commits intoapache:mainfrom
Conversation
3f72684 to
abe3ab1
Compare
mikemccand
left a comment
There was a problem hiding this comment.
This is super exciting! I'm amazed how little code you needed to get this first version running.
I left some comments about how we might share some code with RunAutomaton instead of copying/forking those methods, but I don't think that's a blocker issue.
I love the randomized test - make a random RegExp, find random strings it accepts and confirm the NFARunAutomaton accepts them too. And then generate random (mostly rejected) strings and confirm those are rejected too. And change the order of those two, to exercise the lazy caching.
Another way to make random NFA would be to 1) enumerate random strings, 2) make DFA from those strings (using the efficient builder API Lucene provides), and 3) simply duplicating some states in that DFA creating an NFA, and then 4) confirm that the resulting NFA only accepts that same subset of strings.
Finally, I am really curious about performance. If we take a DFA as it is executed today as baseline, and compare to this new NFA path (running a DFA), what is the overhead?
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Show resolved
Hide resolved
| if (stateSet.size() == 0) { | ||
| return null; | ||
| } | ||
| return new DState(stateSet.getArray()); |
There was a problem hiding this comment.
We can't just use StateSet.freeze here?
lucene/core/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Show resolved
Hide resolved
but a runautomaton for this won't run any queries on its own: brute force isn't how these queries actually work. the important part is the intersection (skipping around)... I suggest, please let's not try to "overshare" and refactor all this stuff alongside DFA stuff until there is a query we can actually benchmark to see if the performance is even viable |
OK yeah +1 to keep it wholly separate (full fork) for now until we learn more how this |
mikemccand
left a comment
There was a problem hiding this comment.
This looks amazing! I think it is close!
I love how you were able to slightly abstract out the RunAutomaton methods so that NFARunAutomaton could be swapped in instead, and no change is needed to AutomatonQuery.
But I think checking for determinizeWorkLimit==0 way down super low inside determinize is a bit too low level ...
I'm super eager to see how performance compares for NFA execution method (determinize lazily on demand) for AutomatonQuery versus the current "determinize up front" DFA approach.
lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/Stepable.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/TransitionAccessor.java
Show resolved
Hide resolved
| } | ||
| } | ||
| Automaton dfa = regExp.toDFA(); | ||
| NFARunAutomaton candidate = new NFARunAutomaton(regExp.toNFA()); |
There was a problem hiding this comment.
Another maybe powerful way to test NFA behavior would be to create any random DFA (e.g. make random set of strings and call that Daciuk/Mihov builder, or random RegExp.toDFA() like here, or maybe even randomly construct something state by state and transition by transition), then create a new test-only method that converts any DFA back into an NFA by randomly picking a DFA state and duplicating it (preserve all incoming and leaving transitions), or maybe by generating N strings accepted by the DFA and unioning them back into it. Do that N times so N states get duplicated. This should not alter the language accepted by the automaton, but should make it very "N".
Finally, from the DFA, randomly enumerate strings it accepts, and then assert the NFARunAutomaton also accepts them. And, sometimes randomly generate random strings that are not accepted by the DFA, and confirm the NFA also does not accept them.
| continue; | ||
| } | ||
| Query dfaQuery = new AutomatonQuery(new Term(FIELD), a); | ||
| Query nfaQuery = new AutomatonQuery(new Term(FIELD), a, 0); |
There was a problem hiding this comment.
It's very cool you are able to completely reuse AutomatonQuery to run either an NFA or DFA by using determinizeWorkLimit = 0 to mean "create an NFA", since you made the two interfaces. Versus forking and then requiring the user make an NFAAutomatonQuery.
But I think it's a little dangerous to take such a low-level approach? (Your // nocommit above). Because that approach means anyone who calls determinize with a 0 work limit will quietly get an NFA back, not just AutomatonQuery who knows how to handle the NFA properly. I.e. other places that call determinize expect to always get back a deterministic automaton? Maybe, instead, we could instead just change AutomatonQuery to take an optional determinizeUpFront boolean, which would default to true (preserving current behavior)?
Or, maybe keep the determinizeWorkLimit==0 to mean "use NFA", but move that if up from way down in determinize to higher up in AutomatonQuery?
mikemccand
left a comment
There was a problem hiding this comment.
I like this change! I think it is close.
I would love to know how this new "determinize on demand" performs, but we have to fix luceneutil to count Query init cost, and then find some "realistic" regexps to test.
But since we are not changing the behavior of AutomatonQuery (it still determinizes up-front by default), I don't think we need to block pushing this (once we iterate on all feedback) on benchmark results?
...ckward-codecs/src/java/org/apache/lucene/backward_codecs/lucene40/blocktree/FieldReader.java
Outdated
Show resolved
Hide resolved
lucene/codecs/src/java/org/apache/lucene/codecs/uniformsplit/IntersectBlockReader.java
Outdated
Show resolved
Hide resolved
| runAutomaton = compiled.runAutomaton; | ||
| compiledAutomaton = compiled; | ||
| if (compiled.nfaRunAutomaton != null) { | ||
| this.runAutomaton = compiled.nfaRunAutomaton; |
There was a problem hiding this comment.
Maybe instead of having separate nfaRunAutomaton and runAutomaton we could have only runAutomaton and a separate CompiledAutomaton.isDeterminized boolean?
There was a problem hiding this comment.
I'm hesitating on doing that since there might be some method still rely on the truth that runAutomaton is of type RunAutomaton but not only using the method abstracted by ByteRunnable.
Also it might be a bit easier to spot if the nfaRunAutomaton is having problem at this stage? Probably we want to merge it after the NFA one is more mature?
There was a problem hiding this comment.
OK we can wait on this. Maybe just add a comment explaining why we need this odd if still, here and in the other places where we did this.
There was a problem hiding this comment.
I too would really prefer it if we can avoid the current "if". It starts to look like a "dance" in all the places where we do it. I don't understand about how it makes debugging easier, can't we just print automaton.isDeterministic() ?
There was a problem hiding this comment.
Hmmm I checked it again and it's a bit complex to merge the runAutomaton and nfaRunAutomaton into one, because previously runAutomaon is having public access and kind of wildly used, it's type directly appears in some public API such as QueryVisitor#consumesTermsMatching, so it might take another PR to try to merge them, I left a todo for now.
An alternative way I'm thinking about is to move those if inside the CompiledAutomaton, and only expose methods like getByteRunnable so that people don't manipulate them outside, I'll include that in next commit.
| * UTF32ToUTF8 conversion | ||
| * @param runnableType NFA or DFA | ||
| */ | ||
| public AutomatonQuery( |
There was a problem hiding this comment.
Cool, so the existing ctor remains, defaulting to DFA execution strategy, where the automaton is first fully determinized.
But now you add another ctor, letting users also ask for NFA execution, where the automaton is determinized lazily on-demand and only in those parts that the terms in this index need to visit.
lucene/core/src/java/org/apache/lucene/util/automaton/ByteRunnable.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
Outdated
Show resolved
Hide resolved
| continue; | ||
| } | ||
| AutomatonQuery dfaQuery = new AutomatonQuery(new Term(FIELD), a); | ||
| AutomatonQuery nfaQuery = new AutomatonQuery(new Term(FIELD), a, ByteRunnable.TYPE.NFA); |
There was a problem hiding this comment.
Could you add a new LuceneTestCase method, newAutomatonQuery, and it would randomly pick between NFA and DFA type? And then in a few pre-existing tests, let's call newAutomatonQuery instead of new AutomatonQuery?
That method could also randomly make the automaton non-deterministic by simple cloning a few states? We can do this in a follow-on issue.
There was a problem hiding this comment.
Good idea! Let's delay it a bit to the next PR maybe (I plan to introduce NFARegexQuery in next PR probably).
lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
Outdated
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
Outdated
Show resolved
Hide resolved
mikemccand
left a comment
There was a problem hiding this comment.
I think this is getting really close! Thanks @zhaih!
| runAutomaton = compiled.runAutomaton; | ||
| compiledAutomaton = compiled; | ||
| if (compiled.nfaRunAutomaton != null) { | ||
| this.runAutomaton = compiled.nfaRunAutomaton; |
There was a problem hiding this comment.
OK we can wait on this. Maybe just add a comment explaining why we need this odd if still, here and in the other places where we did this.
...ckward-codecs/src/java/org/apache/lucene/backward_codecs/lucene40/blocktree/FieldReader.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * Determinize the automaton lazily on-demand as terms are intersected. This option saves the | ||
| * up-front determinize cost, and can handle some RegExps that DFA cannot, but intersection will | ||
| * be a bit slower |
There was a problem hiding this comment.
Missing period at the end of the sentence?
Maybe link to Russ Cox's famous page (https://swtch.com/~rsc/regexp/regexp1.html) and point out that this is similar to the Thompson NFA approach described there?
There was a problem hiding this comment.
Oh this page is linked in NFARunAutomaton's javadoc already :)
| public interface ByteRunnable { | ||
|
|
||
| /** NFA or DFA */ | ||
| enum TYPE { |
There was a problem hiding this comment.
Could you add @lucene.experimental to the TYPE javadocs, and on each of the options (NFA, DFA)?
Also, could you pull out this enum into its own class under o.a.l.util.automaton, maybe AutomatonExecutionMode or AutomatonIntersectionMode or AutomatonExecutionStrategy (getting longish to type...)? I think it is too obscure living down inside this non-consumable (to Lucene users who don't know all sorts of details about automata) ByteRunnable now.
There was a problem hiding this comment.
Good idea, I renamed it to RunAutomatonMode. How's that sounds like?
lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
Outdated
Show resolved
Hide resolved
|
Hmmm one of the test failed But I can't reproduce locally... |
|
The stack trace is interesting though - looks like access to a closed reader pool: |
|
I re-ran the jobs. The stack trace from the error does look suspicious though. |
|
Thanks @dweiss seems this is not the first time we see this error: https://issues.apache.org/jira/browse/LUCENE-9839 |
Looks like this is (scarily!) pre-existing. I don't think it should block this change, and I cannot see anything here that might cause that scary exception. |
mikemccand
left a comment
There was a problem hiding this comment.
Sorry for the slow reply here! The change looks awesome @zhaih. It defaults execution the current approach (DFA for up front determinize) but opens the option to do NFA instead, which determinizes on demand. I left a tiny comment but otherwise I think this is ready. I think we should target 9.x / main for this.
lucene/CHANGES.txt
Outdated
|
|
||
| New Features | ||
|
|
||
| * LUCENE-10010 Introduce NFARunAutomaton to run NFA directly. (Haoyu Zhai) |
There was a problem hiding this comment.
Maybe change to Patrick Zhai for consistency?
Can we at least run existing benchmarks to confirm it doesn't introduce regressions? Some of the code is performance sensitive and we are adding abstractions here. damn java. I'm still confused about the API, as it adds lots of booleans/enums. Do we really need enums, or can a simple boolean suffice? Does AutomatonQuery really need such boolean, or should it just look at |
|
I made a quick prototype with what i mean for the API: #485 The idea is that AutomatonQuery shouldn't be determinizing. Let's push this to the caller. If they pass it a DFA, it uses DFA algorithm. If they pass it NFA, it can use the NFA algorithm (it currently throws an exception in my branch, instead of slowly determinizing, that is the change). |
|
Thanks @rmuir, I'll run a benchmark to ensure this PR does not introduce regression recently. I like the approach you proposed in #485, it would be nice if we can get rid of I think we can try to get that pushed and then I can rebase this one after. |
|
@rmuir I've done the benchmark! (Sorry for the delay): and result looks plain (which is good for this change) |
… IntersectTermsEnum
…to "toDFA" since that changes too many files
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object o) { |
There was a problem hiding this comment.
EqualsGetClass: Overriding Object#equals in a non-final class by using getClass rather than instanceof breaks substitutability of subclasses. (details)
(at-me in a reply with help or ignore)
There was a problem hiding this comment.
I've recorded this as ignored for this pull request. If you change your mind, just comment @sonatype-lift unignore.
|
ok, see #513 which is another PR like the #485, just for |
|
ok, sorry, I realize the latest two PR-split-outs still don't solve your problem. The API is still ugly because we still I don't think we need to Instead, the decisions on this issue should just be about DFA or NFA! |
|
ok one more, but I think it sets us up even better: #528 |
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
Outdated
Show resolved
Hide resolved
| try { | ||
| regExp = new RegExp(AutomatonTestUtil.randomRegexp(random())); | ||
| } catch (IllegalArgumentException e) { | ||
| ignoreException(e); |
There was a problem hiding this comment.
You can also annotate the IllegalArgumentException with @SuppressWarnings("unused").
as far as the random regexp, I think it already does what you want? One tricky thing is, the strings it generates are only valid if you then build regexp with RegExp.NONE. That is probably what causes confusion here.
| runAutomaton = compiled.runAutomaton; | ||
| compiledAutomaton = compiled; | ||
| if (compiled.nfaRunAutomaton != null) { | ||
| this.runAutomaton = compiled.nfaRunAutomaton; |
There was a problem hiding this comment.
I too would really prefer it if we can avoid the current "if". It starts to look like a "dance" in all the places where we do it. I don't understand about how it makes debugging easier, can't we just print automaton.isDeterministic() ?
|
OK @rmuir some new commits are ready to be reviewed! Please take your time :) |
|
Thank you @rmuir and @mikemccand for reviewing this big PR! I'll merge it myself :) |
Description
https://issues.apache.org/jira/browse/LUCENE-10010
Introduces
NFARunAutomatonto run NFA directlyWorks to to:
RunAutomatonclass hierarchyNFARunAutomatonimplementationTests
A unit test that assert the NFARunAutomaton behaves the same as the DFA one by using random generated regex strings
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.