fix(linter/plugins): fix memory leak in tokens and comments#20477
Merged
graphite-app[bot] merged 1 commit intomainfrom Mar 21, 2026
Merged
Conversation
This was referenced Mar 17, 2026
Member
Author
This was referenced Mar 17, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Reduces peak memory usage during linting by ensuring token/comment value slices (and regex pattern/flags) don’t keep entire source-text strings alive across files when pooled objects are reused.
Changes:
- Track indices of lazily deserialized tokens/comments and clear their
valuefields duringreset*to release source-text slices. - Clear regex descriptor
pattern/flags(andtoken.regex) on reset to avoid retaining sliced strings from regex literals. - Make
deserialize*IfNeededinternal and route call sites throughgetToken/getCommentso tracking is consistent.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| apps/oxlint/src-js/plugins/tokens.ts | Track deserialized token indices; clear token values + regex fields on reset; make deserializeTokenIfNeeded private. |
| apps/oxlint/src-js/plugins/tokens_methods.ts | Stop importing/using deserializeTokenIfNeeded; use getToken for range deserialization. |
| apps/oxlint/src-js/plugins/comments.ts | Track deserialized comment indices; clear comment values on reset; make deserializeCommentIfNeeded private. |
| apps/oxlint/src-js/plugins/comments_methods.ts | Stop importing/using deserializeCommentIfNeeded; use getComment for range deserialization. |
0eb994b to
7b220af
Compare
This was referenced Mar 18, 2026
7b220af to
f7bf3dc
Compare
This was referenced Mar 18, 2026
e097ce7 to
4224e9e
Compare
f7bf3dc to
1d06d2c
Compare
Contributor
Merge activity
|
1d06d2c to
c00f3e6
Compare
4224e9e to
8b4b4c0
Compare
Fix a JS-side memory leak from tokens and comments. ## The problem V8 uses "SlicedString" optimization to avoid copying string data. `sourceText.slice(10, 20)` does not copy bytes 10..20 to a new string, but references the original string data of `sourceText` (much like `&str` in Rust). #19978 and #20362 introduced caching and recycling of `Token` and `Comment` objects. This was a large performance gain. However, it contained a flaw. The problem is that `Token` objects now live forever, and they contain little slices of the source text in their `value` property. This means they prevent the source text which they reference from being garbage-collected, and therefore it lives forever too (or at least until another file with an equal or greater number of tokens overwrites the `value` property of all the `Token` objects). ## Solution in this PR Fix this by setting the `value` property of all `Token`s and `Comment`s to `null` after linting a file in `resetTokens` and `resetComments`. ## Implementation details This requires some additional book-keeping to track which tokens have been deserialized and which haven't, so that `resetTokens` / `resetComments` do the minimum amount of work necessary - only resetting tokens / comments which actually had their `value` property set in the first place. The indexes of tokens are stored in 2 x `Uint32Array`s `deserializedTokenIndexes` and `deserializedCommentIndexes`. A simpler solution would have been to store the `Token` / `Comment` objects themselves in a plain `Array`, but `Uint32Array` has 3 advantages: * 4 bytes to store index (`u32`) per token/comment, instead of 8 bytes to store a pointer to the `Token` / `Comment` itself. * Faster to write to `Uint32Array` - no write barrier checks. * On GC runs, a `Uint32Array` is known not to contain references to any other heap objects, so GC tracing doesn't need to walk it, which it would with an `Array`. `Uint32Array` does have one disadvantage over `Array`: * 1 level of indirection in `resetTokens` / `resetComments`. You have to read the index, then index in to `cachedTokens` array to get the `Token` object, in order to reset it. The latter will only have a significant effect when there's a large number of tokens/comments to reset. We avoid this in the worst case where *all* tokens have been deserialized (e.g. by accessing `ast.tokens`). When `allTokensDeserialized === true`, `resetTokens` just loops through `cachedTokens` itself, and ignores the `Uint32Array` entirely. ## Future potential optimizations It's probably not necessary to reset all tokens quite so aggressively when all tokens have been deserialized. If a rule has done that on 1 file, chances are it'll do it again on the next file, at which point most `Token`s get their `value` property overwritten with slices of *that* source text anyway. It's acceptable for source texts to hang around in memory for a bit, just not forever. We could choose to do token resetting only e.g. once every 10 files, to reduce the cost. Possibly clean-up operations could also run in next tick, so they happen in parallel while Rust side is prepping the next file to send over to JS, rather than blocking.
c00f3e6 to
c3d9e91
Compare
graphite-app bot
pushed a commit
that referenced
this pull request
Mar 21, 2026
Optimize storing regex tokens so they can have their `regex` property set back to `undefined` at the end of linting a file. Previously we pushed the `Token`s into an `Array` and then reset the array with `tokensWithRegex.length = 0;` at the end. This had a problem: When you set an array's length to 0, V8 releases the whole backing allocation, which means that on the next file, it has to allocate again. Worse, that new allocation is made in "new space", and if there's a lot of object creation in rules while linting the file, it will be copied during minor GC, and may even get graduated to "old space". All of that is expensive. Instead, use a `Uint32Array` to store indexes, for the same reasons as in #20477.
This was referenced Mar 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Fix a JS-side memory leak from tokens and comments.
The problem
V8 uses "SlicedString" optimization to avoid copying string data.
sourceText.slice(10, 20)does not copy bytes 10..20 to a new string, but references the original string data ofsourceText(much like&strin Rust).#19978 and #20362 introduced caching and recycling of
TokenandCommentobjects. This was a large performance gain.However, it contained a flaw. The problem is that
Tokenobjects now live forever, and they contain little slices of the source text in theirvalueproperty. This means they prevent the source text which they reference from being garbage-collected, and therefore it lives forever too (or at least until another file with an equal or greater number of tokens overwrites thevalueproperty of all theTokenobjects).Solution in this PR
Fix this by setting the
valueproperty of allTokens andComments tonullafter linting a file inresetTokensandresetComments.Implementation details
This requires some additional book-keeping to track which tokens have been deserialized and which haven't, so that
resetTokens/resetCommentsdo the minimum amount of work necessary - only resetting tokens / comments which actually had theirvalueproperty set in the first place.The indexes of tokens are stored in 2 x
Uint32ArraysdeserializedTokenIndexesanddeserializedCommentIndexes. A simpler solution would have been to store theToken/Commentobjects themselves in a plainArray, butUint32Arrayhas 3 advantages:u32) per token/comment, instead of 8 bytes to store a pointer to theToken/Commentitself.Uint32Array- no write barrier checks.Uint32Arrayis known not to contain references to any other heap objects, so GC tracing doesn't need to walk it, which it would with anArray.Uint32Arraydoes have one disadvantage overArray:resetTokens/resetComments. You have to read the index, then index in tocachedTokensarray to get theTokenobject, in order to reset it.The latter will only have a significant effect when there's a large number of tokens/comments to reset. We avoid this in the worst case where all tokens have been deserialized (e.g. by accessing
ast.tokens). WhenallTokensDeserialized === true,resetTokensjust loops throughcachedTokensitself, and ignores theUint32Arrayentirely.Future potential optimizations
It's probably not necessary to reset all tokens quite so aggressively when all tokens have been deserialized. If a rule has done that on 1 file, chances are it'll do it again on the next file, at which point most
Tokens get theirvalueproperty overwritten with slices of that source text anyway.It's acceptable for source texts to hang around in memory for a bit, just not forever. We could choose to do token resetting only e.g. once every 10 files, to reduce the cost.
Possibly clean-up operations could also run in next tick, so they happen in parallel while Rust side is prepping the next file to send over to JS, rather than blocking.