Skip to content

fix(linter/plugins): fix memory leak in tokens and comments#20477

Merged
graphite-app[bot] merged 1 commit intomainfrom
om/03-17-fix_linter_plugins_fix_memory_leak_in_tokens_and_comments
Mar 21, 2026
Merged

fix(linter/plugins): fix memory leak in tokens and comments#20477
graphite-app[bot] merged 1 commit intomainfrom
om/03-17-fix_linter_plugins_fix_memory_leak_in_tokens_and_comments

Conversation

@overlookmotel
Copy link
Member

@overlookmotel overlookmotel commented Mar 17, 2026

Fix a JS-side memory leak from tokens and comments.

The problem

V8 uses "SlicedString" optimization to avoid copying string data. sourceText.slice(10, 20) does not copy bytes 10..20 to a new string, but references the original string data of sourceText (much like &str in Rust).

#19978 and #20362 introduced caching and recycling of Token and Comment objects. This was a large performance gain.

However, it contained a flaw. The problem is that Token objects now live forever, and they contain little slices of the source text in their value property. This means they prevent the source text which they reference from being garbage-collected, and therefore it lives forever too (or at least until another file with an equal or greater number of tokens overwrites the value property of all the Token objects).

Solution in this PR

Fix this by setting the value property of all Tokens and Comments to null after linting a file in resetTokens and resetComments.

Implementation details

This requires some additional book-keeping to track which tokens have been deserialized and which haven't, so that resetTokens / resetComments do the minimum amount of work necessary - only resetting tokens / comments which actually had their value property set in the first place.

The indexes of tokens are stored in 2 x Uint32Arrays deserializedTokenIndexes and deserializedCommentIndexes. A simpler solution would have been to store the Token / Comment objects themselves in a plain Array, but Uint32Array has 3 advantages:

  • 4 bytes to store index (u32) per token/comment, instead of 8 bytes to store a pointer to the Token / Comment itself.
  • Faster to write to Uint32Array - no write barrier checks.
  • On GC runs, a Uint32Array is known not to contain references to any other heap objects, so GC tracing doesn't need to walk it, which it would with an Array.

Uint32Array does have one disadvantage over Array:

  • 1 level of indirection in resetTokens / resetComments. You have to read the index, then index in to cachedTokens array to get the Token object, in order to reset it.

The latter will only have a significant effect when there's a large number of tokens/comments to reset. We avoid this in the worst case where all tokens have been deserialized (e.g. by accessing ast.tokens). When allTokensDeserialized === true, resetTokens just loops through cachedTokens itself, and ignores the Uint32Array entirely.

Future potential optimizations

It's probably not necessary to reset all tokens quite so aggressively when all tokens have been deserialized. If a rule has done that on 1 file, chances are it'll do it again on the next file, at which point most Tokens get their value property overwritten with slices of that source text anyway.

It's acceptable for source texts to hang around in memory for a bit, just not forever. We could choose to do token resetting only e.g. once every 10 files, to reduce the cost.

Possibly clean-up operations could also run in next tick, so they happen in parallel while Rust side is prepping the next file to send over to JS, rather than blocking.

Copy link
Member Author

overlookmotel commented Mar 17, 2026


How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • 0-merge - adds this PR to the back of the merge queue
  • hotfix - for urgent changes, fast-track this PR to the front of the merge queue

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces peak memory usage during linting by ensuring token/comment value slices (and regex pattern/flags) don’t keep entire source-text strings alive across files when pooled objects are reused.

Changes:

  • Track indices of lazily deserialized tokens/comments and clear their value fields during reset* to release source-text slices.
  • Clear regex descriptor pattern/flags (and token.regex) on reset to avoid retaining sliced strings from regex literals.
  • Make deserialize*IfNeeded internal and route call sites through getToken/getComment so tracking is consistent.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
apps/oxlint/src-js/plugins/tokens.ts Track deserialized token indices; clear token values + regex fields on reset; make deserializeTokenIfNeeded private.
apps/oxlint/src-js/plugins/tokens_methods.ts Stop importing/using deserializeTokenIfNeeded; use getToken for range deserialization.
apps/oxlint/src-js/plugins/comments.ts Track deserialized comment indices; clear comment values on reset; make deserializeCommentIfNeeded private.
apps/oxlint/src-js/plugins/comments_methods.ts Stop importing/using deserializeCommentIfNeeded; use getComment for range deserialization.

@graphite-app graphite-app bot added the 0-merge Merge with Graphite Merge Queue label Mar 20, 2026
@overlookmotel overlookmotel removed the 0-merge Merge with Graphite Merge Queue label Mar 20, 2026
@overlookmotel overlookmotel added the 0-merge Merge with Graphite Merge Queue label Mar 20, 2026 — with Graphite App
@overlookmotel overlookmotel force-pushed the om/03-17-refactor_linter_plugins_simplify_token_types branch from e097ce7 to 4224e9e Compare March 21, 2026 12:20
@overlookmotel overlookmotel force-pushed the om/03-17-fix_linter_plugins_fix_memory_leak_in_tokens_and_comments branch from f7bf3dc to 1d06d2c Compare March 21, 2026 12:20
@graphite-app
Copy link
Contributor

graphite-app bot commented Mar 21, 2026

Merge activity

@graphite-app graphite-app bot removed the 0-merge Merge with Graphite Merge Queue label Mar 21, 2026
@overlookmotel overlookmotel force-pushed the om/03-17-fix_linter_plugins_fix_memory_leak_in_tokens_and_comments branch from 1d06d2c to c00f3e6 Compare March 21, 2026 12:32
@overlookmotel overlookmotel force-pushed the om/03-17-refactor_linter_plugins_simplify_token_types branch from 4224e9e to 8b4b4c0 Compare March 21, 2026 12:32
@overlookmotel overlookmotel added 0-merge Merge with Graphite Merge Queue labels Mar 21, 2026
@graphite-app graphite-app bot changed the base branch from om/03-17-refactor_linter_plugins_simplify_token_types to graphite-base/20477 March 21, 2026 12:42
@graphite-app graphite-app bot changed the base branch from graphite-base/20477 to main March 21, 2026 12:45
Fix a JS-side memory leak from tokens and comments.

## The problem

V8 uses "SlicedString" optimization to avoid copying string data. `sourceText.slice(10, 20)` does not copy bytes 10..20 to a new string, but references the original string data of `sourceText` (much like `&str` in Rust).

#19978 and #20362 introduced caching and recycling of `Token` and `Comment` objects. This was a large performance gain.

However, it contained a flaw. The problem is that `Token` objects now live forever, and they contain little slices of the source text in their `value` property. This means they prevent the source text which they reference from being garbage-collected, and therefore it lives forever too (or at least until another file with an equal or greater number of tokens overwrites the `value` property of all the `Token` objects).

## Solution in this PR

Fix this by setting the `value` property of all `Token`s and `Comment`s to `null` after linting a file in `resetTokens` and `resetComments`.

## Implementation details

This requires some additional book-keeping to track which tokens have been deserialized and which haven't, so that `resetTokens` / `resetComments` do the minimum amount of work necessary - only resetting tokens / comments which actually had their `value` property set in the first place.

The indexes of tokens are stored in 2 x `Uint32Array`s `deserializedTokenIndexes` and `deserializedCommentIndexes`. A simpler solution would have been to store the `Token` / `Comment` objects themselves in a plain `Array`, but `Uint32Array` has 3 advantages:

* 4 bytes to store index (`u32`) per token/comment, instead of 8 bytes to store a pointer to the `Token` / `Comment` itself.
* Faster to write to `Uint32Array` - no write barrier checks.
* On GC runs, a `Uint32Array` is known not to contain references to any other heap objects, so GC tracing doesn't need to walk it, which it would with an `Array`.

`Uint32Array` does have one disadvantage over `Array`:

* 1 level of indirection in `resetTokens` / `resetComments`. You have to read the index, then index in to `cachedTokens` array to get the `Token` object, in order to reset it.

The latter will only have a significant effect when there's a large number of tokens/comments to reset. We avoid this in the worst case where *all* tokens have been deserialized (e.g. by accessing `ast.tokens`). When `allTokensDeserialized === true`, `resetTokens` just loops through `cachedTokens` itself, and ignores the `Uint32Array` entirely.

## Future potential optimizations

It's probably not necessary to reset all tokens quite so aggressively when all tokens have been deserialized. If a rule has done that on 1 file, chances are it'll do it again on the next file, at which point most `Token`s get their `value` property overwritten with slices of *that* source text anyway.

It's acceptable for source texts to hang around in memory for a bit, just not forever. We could choose to do token resetting only e.g. once every 10 files, to reduce the cost.

Possibly clean-up operations could also run in next tick, so they happen in parallel while Rust side is prepping the next file to send over to JS, rather than blocking.
@graphite-app graphite-app bot force-pushed the om/03-17-fix_linter_plugins_fix_memory_leak_in_tokens_and_comments branch from c00f3e6 to c3d9e91 Compare March 21, 2026 12:46
graphite-app bot pushed a commit that referenced this pull request Mar 21, 2026
Optimize storing regex tokens so they can have their `regex` property set back to `undefined` at the end of linting a file. Previously we pushed the `Token`s into an `Array` and then reset the array with `tokensWithRegex.length = 0;` at the end.

This had a problem: When you set an array's length to 0, V8 releases the whole backing allocation, which means that on the next file, it has to allocate again. Worse, that new allocation is made in "new space", and if there's a lot of object creation in rules while linting the file, it will be copied during minor GC, and may even get graduated to "old space". All of that is expensive.

Instead, use a `Uint32Array` to store indexes, for the same reasons as in #20477.
@graphite-app graphite-app bot merged commit c3d9e91 into main Mar 21, 2026
23 checks passed
@graphite-app graphite-app bot removed the 0-merge Merge with Graphite Merge Queue label Mar 21, 2026
@graphite-app graphite-app bot deleted the om/03-17-fix_linter_plugins_fix_memory_leak_in_tokens_and_comments branch March 21, 2026 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-cli Area - CLI A-linter Area - Linter A-linter-plugins Area - Linter JS plugins C-bug Category - Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants