query: Allow unlimited pending matches#1127
Conversation
Well, not completely unlimited — we're still using a 16-bit counter to keep track of them. But we longer have a static maximum of 32 pending matches when executing a query.
| assert_eq!(cursor.did_exceed_match_limit(), true); | ||
| }); | ||
| } | ||
| */ |
There was a problem hiding this comment.
This is only commented out temporarily; if we decide that we really want to go with this approach I'll clean up the test suite better.
|
This seems like a good change, if it doesn't slow things down. I do worry that if we're ever getting more than 32 concurrent matches, there may be something else going wrong inside the query execution. In a stack graph query, I would have thought that there would usually only be a few interleaved matches at a time. I'd be curious about what query patterns and tree structures are resulting in the match limit being exceeded, in case there are other bugs in the |
The issue we keep running into is how the number of matches depends not just on the query patterns, but on the depth of the syntax tree that we're matching against. There always ends up being one pattern that reserves a match slot, and then remains pending while we process all of its children. It seems to happen most often when trying to match against JavaScript object literals. Simplifying the queries only changes the depth at which we start to exceed the match limit. This is also one of the reasons for trying to separate the graph DSL stanzas into separate I'll see if we can pull out a minimal failing example. |
maxbrunsfeld
left a comment
There was a problem hiding this comment.
I think that regardless of why the match limit is being exceeded, we should make this change. I think though that the QueryCursor should still have the concept of a match limit, but it should be configurable, and default to being unlimited.
That way, for use cases like syntax highlighting (which uses ts_query_cursor_next_capture and is therefore more susceptible to excessive match buffering), we could still assign a fixed match limit, and get the current behavior where the performance is relatively constant.
Maybe a pair of APIs like like this?
uint32_t ts_query_cursor_match_limit(const TSQueryCursor *);
void ts_query_cursor_set_match_limit(TSQueryCursor *, uint32_t);Then, in those commented-out tests, we could add a call to QueryCursor::set_match_limit, and re-enable the tests. We could even compare the output of two .matches calls, before and after setting the match limit.
|
I like that suggestion, I'll take a stab at that in the next day or two! |
|
Here's a minimized form of some code that was causing trouble. I'll extract out the relevant rules asap. This comes from jquery's a(b => {
{
c: {
if (d) {
e({});
}
}
};
}); |
|
Here's the DSL rules that are relevant. Some might be unnecessary, but this pared down set produces the issue, because of overlapping partial matches. https://gist.github.com/BekaValentine/534b2346ca09378b345d63008781e3d3 |
The default is now a whopping 64K matches, which "should be enough for everyone". You can use the new `ts_query_cursor_set_match_limit` function to set this to a lower limit, such as the previous default of 32.
Ready for another round of review |
There was a problem hiding this comment.
This look excellent. A couple thoughts:
WASM test failures - Are you up for adding bindings to the new match limit APIs in the WASM library? In that library, we simplify the query API by using a single global TSQueryCursor. I think that we could still expose the match limit API though, via a new "options" argument to Query.matches and Query.captures:
const query = JavaScript.query("(identifier) @foo");
const captures = query.captures(tree.rootNode, {matchLimit: 32});
I think we could just always call ts_query_cursor_set_match_limit before calling ts_query_cursor_exec.
Match limit integer size - Would it remove some confusion from the API (and make it easier to document) if we allowed the match limit to range from 0 to UINT32_MAX? I'm not married to using a uint16_t to store the capture list id. I just checked in LLDB, and it appears that we can enlarge it to a uint32_t without changing the size of the QueryState struct:
--- a/lib/src/query.c
+++ b/lib/src/query.c
@@ -157,10 +157,10 @@ typedef struct {
*/
typedef struct {
uint32_t id;
+ uint32_t capture_list_id;
uint16_t start_depth;
uint16_t step_index;
uint16_t pattern_index;
- uint16_t capture_list_id;
uint16_t consumed_capture_count: 12;
bool seeking_immediate_match: 1;
bool has_in_progress_alternatives: 1;(lldb) p sizeof(QueryState)
(unsigned long) $0 = 16
Thanks @BekaValentine! I didn't take the time to extract out the pure query and run it, but just from looking at it, I can kinda see why the match limit is getting exceeded now. There are many distinct patterns that all capture the outer I think this PR is (hopefully) a great fix for the limitations you've been running into. |
|
I'm also excited about this PR because I think it will make it much easier to add randomized test coverage for the query engine. I recently started work on that, but the finite match limit makes it tricker than it would otherwise be. |
This exposes the new configurable match limits for query cursors.
|
@maxbrunsfeld Done! The only wrinkle is that the |
Oh yeah, I forgot about that. If we were doing this from scratch, I probably would have had them all be in one options object. With that structure, you could even support passing in the range restriction in terms of points and of character offsets: query.captures(tree.rootNode, {
startPosition: {row: 5, column: 0},
matchLimit: 32
});
query.captures(tree.rootNode, {
startIndex: 20,
endIndex: 70,
matchLimit: 64
}); But for now, I think what you did is good. We can maybe make a breaking API change later. |
maxbrunsfeld
left a comment
There was a problem hiding this comment.
Sweet! I left a few more minor notes.
| } QueryState; | ||
|
|
||
| typedef Array(TSQueryCapture) CaptureList; | ||
| typedef Array(CaptureList) CaptureListPoolEntry; |
There was a problem hiding this comment.
What do you think of eliminating this typedef? It seems like it's only used in one place; also the name suffix "Entry" makes me think that it's going to be represent one element of some collection, which isn't the case here.
| } | ||
|
|
||
| void ts_query_cursor_set_match_limit(TSQueryCursor *self, uint32_t limit) { | ||
| assert(limit > 0); |
There was a problem hiding this comment.
I'm kind of inclined to just allow zeros here. Even though it would cause a behavior that's not very useful (basically no matches allowed), I don't think it entails an unrecoverable invariant violation that merits a hard abort. Is that true, or would something bad happen if we allowed it?
There was a problem hiding this comment.
That should work — I had put it in because it seemed daft, not because I thought it would break anything.
| if (id >= MAX_CAPTURE_LIST_COUNT) return NONE; | ||
| self->usage_map &= ~bitmask_for_index(id); | ||
| array_clear(&self->list[id]); | ||
| return id; |
There was a problem hiding this comment.
Since we're not using the bitmask anymore, I think there's a header file called bits.h that we can 🔥.
I had started working on trying to detect the old API dynamically, based on whether the parameters contained a |
@maxbrunsfeld: I'm not sure whether this edit is a good or bad idea, but I can definitely say that it eliminates "match limit exceeded" errors for our JavaScript stack graph queries.
Did you choose a static limit on the capture list pool in an attempt to detect and prevent runaway queries? Or was it just an implementation optimization? If the latter, we've found that this dynamically allocated pool has had no effect on query performance. If it's the former, then this might remove an important guardrail and might not be a great idea.
There's also really two separate, orthogonal changes in this patch:
uint64_tinstead ofuint32_t).Even if we decide that the dynamic array isn't a good idea, I'd suggest we merge in the linear scan "find unused" search, since it means that library users can more easily play around with different fixed pool sizes.
/cc @BekaValentine