Skip to content

Remove Operations.isFinite#11813

Merged
rmuir merged 3 commits intoapache:mainfrom
rmuir:nuke_operations_isfinite
Sep 24, 2022
Merged

Remove Operations.isFinite#11813
rmuir merged 3 commits intoapache:mainfrom
rmuir:nuke_operations_isfinite

Conversation

@rmuir
Copy link
Member

@rmuir rmuir commented Sep 24, 2022

This method is recursive: to avoid eating too much stack we apply a small limit. This means it can't really be used on any largish automata without hitting exception.

But the benefit of knowing finite vs infinite in AutomatonTermsEnum is minor: let's not auto-compute this. FuzzyQuery still gets the finite optimization because its finite by definition. PrefixQuery is always infinite. Wildcard/Regex just assume infinite which is safe to do.

Remove the auto-computation and the "trillean" Boolean parameter. If you dont know that your automaton is finite, pass false to CompiledAutomaton, it is safe.

Move this method to AutomatonTestUtil so we can still use it in test asserts.

Closes #11809

This method is recursive: to avoid eating too much stack we apply a
small limit. This means it can't really be used on any largish automata
without hitting exception.

But the benefit of knowing finite vs infinite in AutomatonTermsEnum is
minor: let's not auto-compute this. FuzzyQuery still gets the finite
optimization because its finite by definition. PrefixQuery is always
infinite. Wildcard/Regex just assume infinite which is safe to do.

Remove the auto-computation and the "trillean" Boolean parameter. If you
dont know that your automaton is finite, pass false to
CompiledAutomaton, it is safe.

Move this method to AutomatonTestUtil so we can still use it in test
asserts.
@rmuir rmuir requested review from dweiss and mikemccand September 24, 2022 13:23
Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rmuir!

* @return TokenStream representation of automaton.
*/
public static TokenStream toTokenStream(Automaton automaton) {
if (Operations.isFinite(automaton) == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a warning that this may run forever on an infinite automaton?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, ill fix this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the javadocs (scroll up just a bit more), it looks good?

  /**
   * converts an automaton into a TokenStream. This is done by first Topo sorting the nodes in the
   * Automaton. Nodes that have the same distance from the start are grouped together to form the
   * position nodes for the TokenStream. The resulting TokenStream releases edges from the automaton
   * as tokens in order from the position nodes. This requires the automaton be a finite DAG.
   *
   * @param automaton automaton to convert. Must be a finite DAG.
   * @return TokenStream representation of automaton.
   */
  public static TokenStream toTokenStream(Automaton automaton) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh super yes I agree!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I boosted it a little bit with an explanation mark and the words "infinite loop"

* is one the cases in {@link CompiledAutomaton.AUTOMATON_TYPE}.
* Create this. If simplify is true, we run possibly expensive operations to determine if the
* automaton is one the cases in {@link CompiledAutomaton.AUTOMATON_TYPE}. Set finite to true if
* the automaton is finite, otherwise set to false if infinite or you don't know.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user accidentally claims the automaton was finite but it is not, what happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume wrong answers or assertions :) Only FuzzyTermsEnum uses this optimization to avoid a little CPU/upkeeping for the fuzzy case, its really an opto for that. BlockTree intersection doesn't even look at it, i think.

We could probably tone down some of these CompiledAutomaton ctors to expose it less in the future. We just need a single expert ctor for Fuzzy? I was trying to minimize the scope of API changes here but can do more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK no worries -- no need to do more here! This change is already self-contained and a great progress!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this recursive method is "in the query path", the only remaining recursive method is sortTopoStates, which is less exposed (i think suggesters only).

ultimately it would be great to remove more of these "automatic" (sometimes costly) optimizations, maybe even remove CompiledAutomaton. we've been making progress.

but this Operations.isFinite is definitely the biggest problem and easiest win right now.

@rmuir
Copy link
Member Author

rmuir commented Sep 24, 2022

I'm planning on doing this 10.x-only, not out of laziness, but because there are already several related 10.x changes around this stuff: removal of det in #11049, removal of minimize in #11332, etc.

Not opposed to backporting this stuff to 9.x but we'd need to be careful.

@rmuir rmuir merged commit 15f3743 into apache:main Sep 24, 2022
@magibney
Copy link
Contributor

magibney commented Sep 24, 2022

fwiw I have a local branch (very recent) that changed the implementation of Operations.isFinite() (and Operations.topoSortStates()) to be non-recursive, afaict without sacrificing performance on common use cases. Hoping to push a PR soon.

EDIT: "without sacrificing performance on common use cases" => to be clear, I think the new impl has performance parity with the existing recursive impl across the board, but I haven't evaluated enough to be 100% confident. I can say with confidence that the performance is improved for some common cases, e.g. "prefix search" automaton, where each state as only one dest, and tail call optimizations can be taken.

cbuescher added a commit to elastic/elasticsearch that referenced this pull request Aug 28, 2024
The determinization work limit was removed from the contructor with
apache/lucene#485 and also its not not optional anymore
to pass in whether the automaton is finite or not. Assuming it is not seems to
be the right choice according to apache/lucene#11813
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

input automaton is too large for lengthy wildcard query

4 participants