Conversation
This method is recursive: to avoid eating too much stack we apply a small limit. This means it can't really be used on any largish automata without hitting exception. But the benefit of knowing finite vs infinite in AutomatonTermsEnum is minor: let's not auto-compute this. FuzzyQuery still gets the finite optimization because its finite by definition. PrefixQuery is always infinite. Wildcard/Regex just assume infinite which is safe to do. Remove the auto-computation and the "trillean" Boolean parameter. If you dont know that your automaton is finite, pass false to CompiledAutomaton, it is safe. Move this method to AutomatonTestUtil so we can still use it in test asserts.
| * @return TokenStream representation of automaton. | ||
| */ | ||
| public static TokenStream toTokenStream(Automaton automaton) { | ||
| if (Operations.isFinite(automaton) == false) { |
There was a problem hiding this comment.
Maybe we should add a warning that this may run forever on an infinite automaton?
There was a problem hiding this comment.
I looked at the javadocs (scroll up just a bit more), it looks good?
/**
* converts an automaton into a TokenStream. This is done by first Topo sorting the nodes in the
* Automaton. Nodes that have the same distance from the start are grouped together to form the
* position nodes for the TokenStream. The resulting TokenStream releases edges from the automaton
* as tokens in order from the position nodes. This requires the automaton be a finite DAG.
*
* @param automaton automaton to convert. Must be a finite DAG.
* @return TokenStream representation of automaton.
*/
public static TokenStream toTokenStream(Automaton automaton) {There was a problem hiding this comment.
I boosted it a little bit with an explanation mark and the words "infinite loop"
| * is one the cases in {@link CompiledAutomaton.AUTOMATON_TYPE}. | ||
| * Create this. If simplify is true, we run possibly expensive operations to determine if the | ||
| * automaton is one the cases in {@link CompiledAutomaton.AUTOMATON_TYPE}. Set finite to true if | ||
| * the automaton is finite, otherwise set to false if infinite or you don't know. |
There was a problem hiding this comment.
If a user accidentally claims the automaton was finite but it is not, what happens?
There was a problem hiding this comment.
I assume wrong answers or assertions :) Only FuzzyTermsEnum uses this optimization to avoid a little CPU/upkeeping for the fuzzy case, its really an opto for that. BlockTree intersection doesn't even look at it, i think.
We could probably tone down some of these CompiledAutomaton ctors to expose it less in the future. We just need a single expert ctor for Fuzzy? I was trying to minimize the scope of API changes here but can do more.
There was a problem hiding this comment.
OK no worries -- no need to do more here! This change is already self-contained and a great progress!
There was a problem hiding this comment.
yeah, this recursive method is "in the query path", the only remaining recursive method is sortTopoStates, which is less exposed (i think suggesters only).
ultimately it would be great to remove more of these "automatic" (sometimes costly) optimizations, maybe even remove CompiledAutomaton. we've been making progress.
but this Operations.isFinite is definitely the biggest problem and easiest win right now.
|
fwiw I have a local branch (very recent) that changed the implementation of EDIT: "without sacrificing performance on common use cases" => to be clear, I think the new impl has performance parity with the existing recursive impl across the board, but I haven't evaluated enough to be 100% confident. I can say with confidence that the performance is improved for some common cases, e.g. "prefix search" automaton, where each state as only one |
The determinization work limit was removed from the contructor with apache/lucene#485 and also its not not optional anymore to pass in whether the automaton is finite or not. Assuming it is not seems to be the right choice according to apache/lucene#11813
This method is recursive: to avoid eating too much stack we apply a small limit. This means it can't really be used on any largish automata without hitting exception.
But the benefit of knowing finite vs infinite in AutomatonTermsEnum is minor: let's not auto-compute this. FuzzyQuery still gets the finite optimization because its finite by definition. PrefixQuery is always infinite. Wildcard/Regex just assume infinite which is safe to do.
Remove the auto-computation and the "trillean" Boolean parameter. If you dont know that your automaton is finite, pass false to CompiledAutomaton, it is safe.
Move this method to AutomatonTestUtil so we can still use it in test asserts.
Closes #11809