query: Allow unlimited pending matches by dcreager · Pull Request #1127 · tree-sitter/tree-sitter

dcreager · 2021-05-24T15:09:08Z

@maxbrunsfeld: I'm not sure whether this edit is a good or bad idea, but I can definitely say that it eliminates "match limit exceeded" errors for our JavaScript stack graph queries.

Did you choose a static limit on the capture list pool in an attempt to detect and prevent runaway queries? Or was it just an implementation optimization? If the latter, we've found that this dynamically allocated pool has had no effect on query performance. If it's the former, then this might remove an important guardrail and might not be a great idea.

There's also really two separate, orthogonal changes in this patch:

Using a dynamic instead of fixed array of capture pools.
Being less clever when finding an unused capture pool: this patch just does a linear scan through the pool, which is fast enough since (a) the pool isn't that big and (b) it's a cache-friendly search. The bitmask approach artificially limits us to a 32-element capture pool (64 if you update the CLZ implementation to work on uint64_t instead of uint32_t).

Even if we decide that the dynamic array isn't a good idea, I'd suggest we merge in the linear scan "find unused" search, since it means that library users can more easily play around with different fixed pool sizes.

/cc @BekaValentine

Well, not completely unlimited — we're still using a 16-bit counter to keep track of them. But we longer have a static maximum of 32 pending matches when executing a query.

dcreager · 2021-05-24T15:10:00Z

        assert_eq!(cursor.did_exceed_match_limit(), true);
    });
 }
+*/


This is only commented out temporarily; if we decide that we really want to go with this approach I'll clean up the test suite better.

maxbrunsfeld · 2021-05-24T19:50:29Z

This seems like a good change, if it doesn't slow things down.

I do worry that if we're ever getting more than 32 concurrent matches, there may be something else going wrong inside the query execution. In a stack graph query, I would have thought that there would usually only be a few interleaved matches at a time. I'd be curious about what query patterns and tree structures are resulting in the match limit being exceeded, in case there are other bugs in the TSQueryCursor that we could fix. Do you think the high number of pending matches is expected?

dcreager · 2021-05-24T21:02:03Z

I'd be curious about what query patterns and tree structures are resulting in the match limit being exceeded, in case there are other bugs in the TSQueryCursor that we could fix. Do you think the high number of pending matches is expected?

The issue we keep running into is how the number of matches depends not just on the query patterns, but on the depth of the syntax tree that we're matching against. There always ends up being one pattern that reserves a match slot, and then remains pending while we process all of its children. It seems to happen most often when trying to match against JavaScript object literals. Simplifying the queries only changes the depth at which we start to exceed the match limit.

This is also one of the reasons for trying to separate the graph DSL stanzas into separate TSQuery instances — on the hope that our queries are less likely to conflict with themselves, even with deep syntax trees.

I'll see if we can pull out a minimal failing example.

maxbrunsfeld

I think that regardless of why the match limit is being exceeded, we should make this change. I think though that the QueryCursor should still have the concept of a match limit, but it should be configurable, and default to being unlimited.

That way, for use cases like syntax highlighting (which uses ts_query_cursor_next_capture and is therefore more susceptible to excessive match buffering), we could still assign a fixed match limit, and get the current behavior where the performance is relatively constant.

Maybe a pair of APIs like like this?

uint32_t ts_query_cursor_match_limit(const TSQueryCursor *);
void ts_query_cursor_set_match_limit(TSQueryCursor *, uint32_t);

Then, in those commented-out tests, we could add a call to QueryCursor::set_match_limit, and re-enable the tests. We could even compare the output of two .matches calls, before and after setting the match limit.

dcreager · 2021-05-26T18:57:26Z

I like that suggestion, I'll take a stab at that in the next day or two!

BekaValentine · 2021-05-26T22:00:03Z

Here's a minimized form of some code that was causing trouble. I'll extract out the relevant rules asap.

This comes from jquery's css.js file, after the error-causing parts are pulled out.

a(b => {
  {
    c: {
      if (d) {
        e({});
      }
    }
  };
});

BekaValentine · 2021-05-27T00:31:31Z

Here's the DSL rules that are relevant. Some might be unnecessary, but this pared down set produces the issue, because of overlapping partial matches.

https://gist.github.com/BekaValentine/534b2346ca09378b345d63008781e3d3

The default is now a whopping 64K matches, which "should be enough for everyone". You can use the new `ts_query_cursor_set_match_limit` function to set this to a lower limit, such as the previous default of 32.

dcreager · 2021-06-02T15:32:07Z

Maybe a pair of APIs like like this?

Ready for another round of review

maxbrunsfeld

This look excellent. A couple thoughts:

WASM test failures - Are you up for adding bindings to the new match limit APIs in the WASM library? In that library, we simplify the query API by using a single global TSQueryCursor. I think that we could still expose the match limit API though, via a new "options" argument to Query.matches and Query.captures:

const query = JavaScript.query("(identifier) @foo");
const captures = query.captures(tree.rootNode, {matchLimit: 32});

I think we could just always call ts_query_cursor_set_match_limit before calling ts_query_cursor_exec.

Match limit integer size - Would it remove some confusion from the API (and make it easier to document) if we allowed the match limit to range from 0 to UINT32_MAX? I'm not married to using a uint16_t to store the capture list id. I just checked in LLDB, and it appears that we can enlarge it to a uint32_t without changing the size of the QueryState struct:

--- a/lib/src/query.c
+++ b/lib/src/query.c
@@ -157,10 +157,10 @@ typedef struct {
  */
 typedef struct {
   uint32_t id;
+  uint32_t capture_list_id;
   uint16_t start_depth;
   uint16_t step_index;
   uint16_t pattern_index;
-  uint16_t capture_list_id;
   uint16_t consumed_capture_count: 12;
   bool seeking_immediate_match: 1;
   bool has_in_progress_alternatives: 1;

(lldb) p sizeof(QueryState)
(unsigned long) $0 = 16

maxbrunsfeld · 2021-06-02T16:06:12Z

Here's the DSL rules that are relevant. Some might be unnecessary, but this pared down set produces the issue, because of overlapping partial matches.

Thanks @BekaValentine! I didn't take the time to extract out the pure query and run it, but just from looking at it, I can kinda see why the match limit is getting exceeded now. There are many distinct patterns that all capture the outer object, along with different children, so there's fundamentally a fairly large number of interleaved capture lists.

I think this PR is (hopefully) a great fix for the limitations you've been running into.

maxbrunsfeld · 2021-06-02T16:35:34Z

I'm also excited about this PR because I think it will make it much easier to add randomized test coverage for the query engine. I recently started work on that, but the finite match limit makes it tricker than it would otherwise be.

This exposes the new configurable match limits for query cursors.

dcreager · 2021-06-02T17:54:40Z

@maxbrunsfeld Done!

The only wrinkle is that the Query.matches and Query.captures methods already had additional optional parameters, to control the range of the syntax tree to apply the query against. For simplicity, I just added the new options map as a third parameter — if you want to provide a matchLimit, you must explicitly pass in nulls as the start and end positions:

tree-sitter/lib/binding_web/test/query-test.js

Line 259 in ad3907c

const captures = query.captures(tree.rootNode, null, null, {matchLimit: 32});

maxbrunsfeld · 2021-06-02T18:00:16Z

if you want to provide a matchLimit, you must explicitly pass in nulls as the start and end positions

Oh yeah, I forgot about that. If we were doing this from scratch, I probably would have had them all be in one options object. With that structure, you could even support passing in the range restriction in terms of points and of character offsets:

query.captures(tree.rootNode, {
  startPosition: {row: 5, column: 0},
  matchLimit: 32
}); 

query.captures(tree.rootNode, {
  startIndex: 20,
  endIndex: 70,
  matchLimit: 64
});

But for now, I think what you did is good. We can maybe make a breaking API change later.

maxbrunsfeld

Sweet! I left a few more minor notes.

maxbrunsfeld · 2021-06-02T18:03:51Z

 } QueryState;

 typedef Array(TSQueryCapture) CaptureList;
+typedef Array(CaptureList) CaptureListPoolEntry;


What do you think of eliminating this typedef? It seems like it's only used in one place; also the name suffix "Entry" makes me think that it's going to be represent one element of some collection, which isn't the case here.

maxbrunsfeld · 2021-06-02T18:05:08Z

+}
+
+void ts_query_cursor_set_match_limit(TSQueryCursor *self, uint32_t limit) {
+  assert(limit > 0);


I'm kind of inclined to just allow zeros here. Even though it would cause a behavior that's not very useful (basically no matches allowed), I don't think it entails an unrecoverable invariant violation that merits a hard abort. Is that true, or would something bad happen if we allowed it?

That should work — I had put it in because it seemed daft, not because I thought it would break anything.

maxbrunsfeld · 2021-06-02T18:06:50Z

-  if (id >= MAX_CAPTURE_LIST_COUNT) return NONE;
-  self->usage_map &= ~bitmask_for_index(id);
-  array_clear(&self->list[id]);
-  return id;


Since we're not using the bitmask anymore, I think there's a header file called bits.h that we can 🔥.

dcreager · 2021-06-02T18:18:30Z

But for now, I think what you did is good. We can maybe make a breaking API change later

I had started working on trying to detect the old API dynamically, based on whether the parameters contained a row field or not, but it got convoluted pretty quickly.

query: Allow unlimited pending matches

7801072

Well, not completely unlimited — we're still using a 16-bit counter to keep track of them. But we longer have a static maximum of 32 pending matches when executing a query.

dcreager commented May 24, 2021

View reviewed changes

maxbrunsfeld reviewed May 26, 2021

View reviewed changes

query: Allow configurable match limit

cd96552

The default is now a whopping 64K matches, which "should be enough for everyone". You can use the new `ts_query_cursor_set_match_limit` function to set this to a lower limit, such as the previous default of 32.

dcreager force-pushed the query-mempool branch from 25dca62 to cd96552 Compare June 2, 2021 15:31

maxbrunsfeld reviewed Jun 2, 2021

View reviewed changes

dcreager added 2 commits June 2, 2021 13:19

query: Use uint32_t for capture list IDs

1f6eac5

wasm: Add matchLimit option to query methods

ad3907c

This exposes the new configurable match limits for query cursors.

maxbrunsfeld reviewed Jun 2, 2021

View reviewed changes

dcreager added 2 commits June 2, 2021 14:14

query: Remove bits.h

47f1af8

query: Minor cleanups

cc20708

maxbrunsfeld approved these changes Jun 2, 2021

View reviewed changes

maxbrunsfeld merged commit 82f3d32 into tree-sitter:master Jun 2, 2021

dcreager deleted the query-mempool branch June 2, 2021 18:58

stsewd mentioned this pull request Jun 26, 2021

Treesitter 0.19.5 unlimited pending matches makes Neovim unusable when active neovim/neovim#14897

Closed

clason mentioned this pull request May 25, 2024

feat(treesitter): add tsmatchlimit options neovim/neovim#26563

Closed

clason mentioned this pull request Apr 26, 2026

fix(query): overflow in capture pool ids + some perf #5548

Merged

Uh oh!

Conversation

dcreager commented May 24, 2021

Uh oh!

dcreager May 24, 2021

Choose a reason for hiding this comment

Uh oh!

maxbrunsfeld commented May 24, 2021

Uh oh!

dcreager commented May 24, 2021

Uh oh!

maxbrunsfeld left a comment

Choose a reason for hiding this comment

Uh oh!

dcreager commented May 26, 2021

Uh oh!

BekaValentine commented May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BekaValentine commented May 27, 2021

Uh oh!

dcreager commented Jun 2, 2021

Uh oh!

maxbrunsfeld left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxbrunsfeld commented Jun 2, 2021

Uh oh!

maxbrunsfeld commented Jun 2, 2021

Uh oh!

dcreager commented Jun 2, 2021

Uh oh!

maxbrunsfeld commented Jun 2, 2021

Uh oh!

maxbrunsfeld left a comment

Choose a reason for hiding this comment

Uh oh!

maxbrunsfeld Jun 2, 2021

Choose a reason for hiding this comment

Uh oh!

dcreager Jun 2, 2021

Choose a reason for hiding this comment

Uh oh!

maxbrunsfeld Jun 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcreager Jun 2, 2021

Choose a reason for hiding this comment

Uh oh!

maxbrunsfeld Jun 2, 2021

Choose a reason for hiding this comment

Uh oh!

dcreager Jun 2, 2021

Choose a reason for hiding this comment

Uh oh!

dcreager commented Jun 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BekaValentine commented May 26, 2021 •

edited

Loading

maxbrunsfeld left a comment •

edited

Loading

maxbrunsfeld Jun 2, 2021 •

edited

Loading