Skip to content

fix(query): overflow in capture pool ids + some perf#5548

Merged
WillLillis merged 3 commits into
tree-sitter:masterfrom
WillLillis:highlight_overflow
Apr 26, 2026
Merged

fix(query): overflow in capture pool ids + some perf#5548
WillLillis merged 3 commits into
tree-sitter:masterfrom
WillLillis:highlight_overflow

Conversation

@WillLillis

Copy link
Copy Markdown
Member

#5011 describes a crash when highlighting a large zig file. Amaan and I looked at this issue briefly when it was first opened and suspected it was the highlight crate's unsafe usage of various query structures (grep for std::mem::transmute and you'll see it). This is necessary to allow for the highlight crate's peeking iterator behavior, which the updated streaming iterator provided by tree-sitter's rust bindings does not allow. The worry there was that fixing this bug would require a significant or complete rewrite of the highlight crate to use the new streaming iterators.

However, after digging into this issue more the problem is actually an integer overflow in query.c. The capture pool uses 16 bit ids, but it really should allow for 32. That change alone is enough to prevent the crash. If you re-run the repro with this fix though, you may notice that it takes forever. This is because ts_query_cursor_next_capture does a linear scan over all finished query states to find the next capture. This is the perfect use case for a min heap instead!

I've done some spot checking and this doesn't appear to introduce any regressions with small files, and it greatly helps when working with captures within a large file. For example, on my machine, highlighting the repro from #5011:

On master:

Segfaults after ~22 seconds.

With the first commit from this PR:

Finishes successfully in ~377 seconds. Preventing the segfault exposes this performance cliff.

With both commits from this PR:

Finishes successfully in ~5 seconds.

Downstream Considerations

I'm not super sure how/where to test the affects of this boost on downstream. I assume most consumers have limits in place to prevent such a large number of finished states from accumulating, but if anyone has any ideas (CC @clason @ribru17 ) I'd love to get some more numbers here.

@WillLillis WillLillis added c-library query Related to query execution highlight Ralated to `tree-sitter-highlight` functionality perf Performance related labels Apr 25, 2026
@amaanq amaanq force-pushed the highlight_overflow branch from 04c6876 to a5d4dd8 Compare April 26, 2026 08:30
@clason

clason commented Apr 26, 2026

Copy link
Copy Markdown
Member

I think nvim-treesitter's indents function iterates over all captures in a file (since you may want to reindent the whole buffer); Nvim has a built-in (configurable) match limit though -- whose need this PR may obviate?

@clason

clason commented Apr 26, 2026

Copy link
Copy Markdown
Member

But even for functions that restrict the cursor to a section, that happens line-wise so things like minified JSON would still be affected (modulo match limit, which can be disabled).

@clason

clason commented Apr 26, 2026

Copy link
Copy Markdown
Member

Some history: #1127

WillLillis and others added 3 commits April 26, 2026 17:12
Commit 1f6eac5 ("query: Use uint32_t for capture list IDs") widened
QueryState.capture_list_id to uint32_t and removed the 65536 pool cap,
but left the pool function signatures as uint16_t. This caused silent
truncation when the pool exceeded 65535 entries, leading to a segfault.
`ts_query_cursor_next_capture` linearly scanned all finished states to
find the one with the earliest next capture byte offset. With deeply
nested code, this O(n) scan per capture caused the highlight crate to
hang for minutes on large files.

Replace the linear scan with a min-heap over the finished_states array,
keyed by (next_capture_byte_offset, pattern_index, id). The heap is
maintained lazily: ts_query_cursor__advance uses plain array_push
(preserving FIFO insertion order), and next_capture sifts new elements
into the heap on entry via a tracked heap_size boundary. This preserves
the documented "order found" guarantee for next_match while giving
next_capture O(log n) per call.
@WillLillis WillLillis merged commit 43dc8ea into tree-sitter:master Apr 26, 2026
27 of 28 checks passed
@github-actions github-actions Bot removed the request for review from maxbrunsfeld April 26, 2026 22:04
@WillLillis WillLillis deleted the highlight_overflow branch April 26, 2026 22:50
@WillLillis WillLillis added the ci:backport release-0.26 Backport label label Apr 27, 2026
@tree-sitter-ci-bot

Copy link
Copy Markdown

Successfully created backport PR for release-0.26:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c-library ci:backport release-0.26 Backport label highlight Ralated to `tree-sitter-highlight` functionality perf Performance related query Related to query execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segfault in tree-sitter highlight on long Zig file

3 participants