Skip to content

query: Allow unlimited pending matches#1127

Merged
maxbrunsfeld merged 6 commits into
tree-sitter:masterfrom
dcreager:query-mempool
Jun 2, 2021
Merged

query: Allow unlimited pending matches#1127
maxbrunsfeld merged 6 commits into
tree-sitter:masterfrom
dcreager:query-mempool

Conversation

@dcreager

Copy link
Copy Markdown
Contributor

@maxbrunsfeld: I'm not sure whether this edit is a good or bad idea, but I can definitely say that it eliminates "match limit exceeded" errors for our JavaScript stack graph queries.

Did you choose a static limit on the capture list pool in an attempt to detect and prevent runaway queries? Or was it just an implementation optimization? If the latter, we've found that this dynamically allocated pool has had no effect on query performance. If it's the former, then this might remove an important guardrail and might not be a great idea.

There's also really two separate, orthogonal changes in this patch:

  • Using a dynamic instead of fixed array of capture pools.
  • Being less clever when finding an unused capture pool: this patch just does a linear scan through the pool, which is fast enough since (a) the pool isn't that big and (b) it's a cache-friendly search. The bitmask approach artificially limits us to a 32-element capture pool (64 if you update the CLZ implementation to work on uint64_t instead of uint32_t).

Even if we decide that the dynamic array isn't a good idea, I'd suggest we merge in the linear scan "find unused" search, since it means that library users can more easily play around with different fixed pool sizes.

/cc @BekaValentine

Well, not completely unlimited — we're still using a 16-bit counter to
keep track of them.  But we longer have a static maximum of 32 pending
matches when executing a query.
Comment thread cli/src/tests/query_test.rs Outdated
assert_eq!(cursor.did_exceed_match_limit(), true);
});
}
*/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only commented out temporarily; if we decide that we really want to go with this approach I'll clean up the test suite better.

@maxbrunsfeld

Copy link
Copy Markdown
Contributor

This seems like a good change, if it doesn't slow things down.

I do worry that if we're ever getting more than 32 concurrent matches, there may be something else going wrong inside the query execution. In a stack graph query, I would have thought that there would usually only be a few interleaved matches at a time. I'd be curious about what query patterns and tree structures are resulting in the match limit being exceeded, in case there are other bugs in the TSQueryCursor that we could fix. Do you think the high number of pending matches is expected?

@dcreager

Copy link
Copy Markdown
Contributor Author

I'd be curious about what query patterns and tree structures are resulting in the match limit being exceeded, in case there are other bugs in the TSQueryCursor that we could fix. Do you think the high number of pending matches is expected?

The issue we keep running into is how the number of matches depends not just on the query patterns, but on the depth of the syntax tree that we're matching against. There always ends up being one pattern that reserves a match slot, and then remains pending while we process all of its children. It seems to happen most often when trying to match against JavaScript object literals. Simplifying the queries only changes the depth at which we start to exceed the match limit.

This is also one of the reasons for trying to separate the graph DSL stanzas into separate TSQuery instances — on the hope that our queries are less likely to conflict with themselves, even with deep syntax trees.

I'll see if we can pull out a minimal failing example.

@maxbrunsfeld maxbrunsfeld left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that regardless of why the match limit is being exceeded, we should make this change. I think though that the QueryCursor should still have the concept of a match limit, but it should be configurable, and default to being unlimited.

That way, for use cases like syntax highlighting (which uses ts_query_cursor_next_capture and is therefore more susceptible to excessive match buffering), we could still assign a fixed match limit, and get the current behavior where the performance is relatively constant.

Maybe a pair of APIs like like this?

uint32_t ts_query_cursor_match_limit(const TSQueryCursor *);
void ts_query_cursor_set_match_limit(TSQueryCursor *, uint32_t);

Then, in those commented-out tests, we could add a call to QueryCursor::set_match_limit, and re-enable the tests. We could even compare the output of two .matches calls, before and after setting the match limit.

@dcreager

Copy link
Copy Markdown
Contributor Author

I like that suggestion, I'll take a stab at that in the next day or two!

@BekaValentine

BekaValentine commented May 26, 2021

Copy link
Copy Markdown
Contributor

Here's a minimized form of some code that was causing trouble. I'll extract out the relevant rules asap.

This comes from jquery's css.js file, after the error-causing parts are pulled out.

a(b => {
  {
    c: {
      if (d) {
        e({});
      }
    }
  };
});

@BekaValentine

Copy link
Copy Markdown
Contributor

Here's the DSL rules that are relevant. Some might be unnecessary, but this pared down set produces the issue, because of overlapping partial matches.

https://gist.github.com/BekaValentine/534b2346ca09378b345d63008781e3d3

The default is now a whopping 64K matches, which "should be enough for
everyone".  You can use the new `ts_query_cursor_set_match_limit`
function to set this to a lower limit, such as the previous default of
32.
@dcreager

dcreager commented Jun 2, 2021

Copy link
Copy Markdown
Contributor Author

Maybe a pair of APIs like like this?

Ready for another round of review

@maxbrunsfeld maxbrunsfeld left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look excellent. A couple thoughts:

WASM test failures - Are you up for adding bindings to the new match limit APIs in the WASM library? In that library, we simplify the query API by using a single global TSQueryCursor. I think that we could still expose the match limit API though, via a new "options" argument to Query.matches and Query.captures:

const query = JavaScript.query("(identifier) @foo");
const captures = query.captures(tree.rootNode, {matchLimit: 32});

I think we could just always call ts_query_cursor_set_match_limit before calling ts_query_cursor_exec.

Match limit integer size - Would it remove some confusion from the API (and make it easier to document) if we allowed the match limit to range from 0 to UINT32_MAX? I'm not married to using a uint16_t to store the capture list id. I just checked in LLDB, and it appears that we can enlarge it to a uint32_t without changing the size of the QueryState struct:

--- a/lib/src/query.c
+++ b/lib/src/query.c
@@ -157,10 +157,10 @@ typedef struct {
  */
 typedef struct {
   uint32_t id;
+  uint32_t capture_list_id;
   uint16_t start_depth;
   uint16_t step_index;
   uint16_t pattern_index;
-  uint16_t capture_list_id;
   uint16_t consumed_capture_count: 12;
   bool seeking_immediate_match: 1;
   bool has_in_progress_alternatives: 1;
(lldb) p sizeof(QueryState)
(unsigned long) $0 = 16

@maxbrunsfeld

Copy link
Copy Markdown
Contributor

Here's the DSL rules that are relevant. Some might be unnecessary, but this pared down set produces the issue, because of overlapping partial matches.

Thanks @BekaValentine! I didn't take the time to extract out the pure query and run it, but just from looking at it, I can kinda see why the match limit is getting exceeded now. There are many distinct patterns that all capture the outer object, along with different children, so there's fundamentally a fairly large number of interleaved capture lists.

I think this PR is (hopefully) a great fix for the limitations you've been running into.

@maxbrunsfeld

Copy link
Copy Markdown
Contributor

I'm also excited about this PR because I think it will make it much easier to add randomized test coverage for the query engine. I recently started work on that, but the finite match limit makes it tricker than it would otherwise be.

dcreager added 2 commits June 2, 2021 13:19
This exposes the new configurable match limits for query cursors.
@dcreager

dcreager commented Jun 2, 2021

Copy link
Copy Markdown
Contributor Author

@maxbrunsfeld Done!

The only wrinkle is that the Query.matches and Query.captures methods already had additional optional parameters, to control the range of the syntax tree to apply the query against. For simplicity, I just added the new options map as a third parameter — if you want to provide a matchLimit, you must explicitly pass in nulls as the start and end positions:

const captures = query.captures(tree.rootNode, null, null, {matchLimit: 32});

@maxbrunsfeld

Copy link
Copy Markdown
Contributor

if you want to provide a matchLimit, you must explicitly pass in nulls as the start and end positions

Oh yeah, I forgot about that. If we were doing this from scratch, I probably would have had them all be in one options object. With that structure, you could even support passing in the range restriction in terms of points and of character offsets:

query.captures(tree.rootNode, {
  startPosition: {row: 5, column: 0},
  matchLimit: 32
}); 

query.captures(tree.rootNode, {
  startIndex: 20,
  endIndex: 70,
  matchLimit: 64
}); 

But for now, I think what you did is good. We can maybe make a breaking API change later.

@maxbrunsfeld maxbrunsfeld left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet! I left a few more minor notes.

Comment thread lib/src/query.c Outdated
} QueryState;

typedef Array(TSQueryCapture) CaptureList;
typedef Array(CaptureList) CaptureListPoolEntry;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of eliminating this typedef? It seems like it's only used in one place; also the name suffix "Entry" makes me think that it's going to be represent one element of some collection, which isn't the case here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread lib/src/query.c Outdated
}

void ts_query_cursor_set_match_limit(TSQueryCursor *self, uint32_t limit) {
assert(limit > 0);

@maxbrunsfeld maxbrunsfeld Jun 2, 2021

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of inclined to just allow zeros here. Even though it would cause a behavior that's not very useful (basically no matches allowed), I don't think it entails an unrecoverable invariant violation that merits a hard abort. Is that true, or would something bad happen if we allowed it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should work — I had put it in because it seemed daft, not because I thought it would break anything.

Comment thread lib/src/query.c
if (id >= MAX_CAPTURE_LIST_COUNT) return NONE;
self->usage_map &= ~bitmask_for_index(id);
array_clear(&self->list[id]);
return id;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're not using the bitmask anymore, I think there's a header file called bits.h that we can 🔥.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dcreager

dcreager commented Jun 2, 2021

Copy link
Copy Markdown
Contributor Author

But for now, I think what you did is good. We can maybe make a breaking API change later

I had started working on trying to detect the old API dynamically, based on whether the parameters contained a row field or not, but it got convoluted pretty quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants