Reduce memory usage by dropping token-excess capacity and improve performance by approximating the initial tokens vec size by MichaReiser · Pull Request #25354 · astral-sh/ruff

MichaReiser · 2026-05-23T11:35:43Z

Summary

This PR plugs the existing allocate_tokens_vec(contents) heuristic into our parser. The idea of allocate_tokens_vec is to approximate the size of the allocations Vec based on the source text size instead of starting from an empty Vec, to reduce the number of regrows during lexing. Reducing the allocations helps with performance. However, this isn't a zero-cost optimization because it increases memory usage when we over-approximate. This PR adds a shrink_to_fit call once parsing's finished to mitigate the memory usage increase to peak-memory usage only. However, shrink_to_fit also isn't free; it can require a reallocation, depending on the amount of excess capacity. However, calling shrink_to_fit shows that it does reduce ty's memory usage.

The parser memory usage benchmarks look very concerning. I think they're a bit overdramatic. What they show is that peak memory usage increases, which is accurate. However, for most benchmarks, the number of allocations reduces. Which includes many ty full-project runs, which I consider more meaningful.

For most benchmarks, performance increases by 1-2%. There are very few for which performance decreases by 1-2%. There are two parser benchmarks that stand out: One for which performance increases by 10% and one for which performance decreases by 5%.

Overall, I'm leaning towards landing this change, because it allows us to reduce memory usage in ty, while still suggesting a small perf improvement over all.

astral-sh-bot · 2026-05-23T11:44:39Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

codspeed-hq · 2026-05-23T11:45:04Z

Merging this PR will not alter performance

✅ 123 untouched benchmarks

_{Comparing micha/view-benchmark-token-capacity (1e4aafa) with main (258ca11)}

astral-sh-bot · 2026-05-24T07:37:06Z

`ecosystem-analyzer` results

No diagnostic changes detected ✅

Full report with detailed diff (timing results)

astral-sh-bot · 2026-05-24T15:27:15Z

Memory usage report

Summary

Project	Old	New	Diff	Outcome
flake8	46.91MB	46.12MB	-1.69% (809.61kB)	⬇️
trio	112.89MB	111.64MB	-1.11% (1.25MB)	⬇️
sphinx	260.93MB	259.39MB	-0.59% (1.54MB)	⬇️
prefect	696.97MB	695.33MB	-0.24% (1.64MB)	⬇️

Significant changes

Click to expand detailed breakdown

flake8

Name	Old	New	Diff	Outcome
`parsed_module`	16.34MB	15.55MB	-4.84% (809.61kB)	⬇️

trio

Name	Old	New	Diff	Outcome
`parsed_module`	24.95MB	23.70MB	-5.01% (1.25MB)	⬇️

sphinx

Name	Old	New	Diff	Outcome
`parsed_module`	30.37MB	28.84MB	-5.06% (1.54MB)	⬇️

prefect

Name	Old	New	Diff	Outcome
`parsed_module`	31.98MB	30.35MB	-5.13% (1.64MB)	⬇️

dhruvmanila · 2026-05-25T10:24:11Z

I think another issue here is that the source that's using the approximate the capacity is always the complete source text, even when the start_offset is a non-zero value. This means that the linter will always over-allocation for multiple cases like stringified annotations, specific rule based lexing, etc. Could we try to take start_offset into account when computing the approximate capacity and see if that reduces / fixes the regression? Or, do you think this shouldn't affect it?

Edit: tried it in #25382, it seems to at least fix the linter benchmark

charliermarsh · 2026-05-25T15:19:28Z

(Oh nice, that's a good catch.)

MichaReiser · 2026-05-26T17:05:55Z

I updated the heuristic to reduce the instances where we allocate too much storage. I really like codex explanation here, which is why I copy paste some of it:

The optimization preallocates the parser's token buffer using only the remaining source length, without scanning source contents: This is intentionally an underestimate. The goal is not to predict the final token count precisely; it is to skip several small early Vec growth steps while avoiding the memory regressions caused by over-reserving.

Across 28,758 Python files sampled from Ruff ecosystem projects, a len / 8 estimate is at or below the actual stored-token count for approximately 77.5% of files. That makes it a useful conservative estimate: for most files, the buffer still grows if necessary, but starts far enough ahead to avoid several early reallocations.

Why underestimate?
Overestimating can make peak memory worse. A token-sparse file containing long strings, comments, or docstrings may have many source bytes but few tokens. If an arbitrary estimate reserves too much capacity, the parser can allocate a larger buffer than ordinary Vec growth would ever have reached.

Underestimating is cheaper: the buffer simply performs its normal later growth steps. We still receive most of the allocation reduction without turning uncommon sparse files into systematic memory regressions.

Why the minimum capacity cutoff?

For small files, preallocation has limited value. Reserving fewer than 128 tokens only skips small, cheap growth steps, while increasing the number of small files whose initial allocation exceeds their ordinary final capacity.

Using a 128-token cutoff keeps small files on normal Vec growth and concentrates the optimization on files where it saves meaningful work.

Why round down with ilog2()?
The power-of-two rounding aligns the initial capacity with the capacity buckets produced by geometric Vec growth.
For example, a file with 748 tokens normally grows to a capacity of 1024. An arbitrary initial capacity of 660 grows to 1320, increasing peak memory. Rounding the estimate down to 512 instead produces:
normal growth:       ... -> 512 -> 1024
preallocated growth: 512 -> 1024
The allocation starts later, but the final capacity remains in the same bucket as normal growth.

…formance by approximating the initial tokens vec size (astral-sh#25354) Co-authored-by: Dhruv Manilawala <dhruvmanila@gmail.com>

MichaReiser closed this May 23, 2026

MichaReiser reopened this May 23, 2026

MichaReiser changed the title ~~View benchmark results~~ Approximate token vec size to reduce regrows May 24, 2026

MichaReiser closed this May 24, 2026

MichaReiser reopened this May 24, 2026

MichaReiser added 5 commits May 24, 2026 17:24

Use estimated capacity for parser token source

5b636fa

Shrink token buffer after parsing

744cf83

Use conservative token capacity estimate

7aed0ac

Change to 10%

18746e8

Go up to 15% again

5879173

MichaReiser force-pushed the micha/view-benchmark-token-capacity branch from e02394f to 5879173 Compare May 24, 2026 15:24

MichaReiser added the parser Related to the parser label May 24, 2026

MichaReiser marked this pull request as ready for review May 24, 2026 16:21

MichaReiser requested a review from dhruvmanila as a code owner May 24, 2026 16:21

charliermarsh approved these changes May 25, 2026

View reviewed changes

Use remaining source for token capacity estimates (#25382)

2edc8c7

MichaReiser closed this May 26, 2026

MichaReiser reopened this May 26, 2026

dhruvmanila approved these changes May 26, 2026

View reviewed changes

Comment thread crates/ruff_python_parser/src/token_source.rs Outdated

MichaReiser commented May 26, 2026

View reviewed changes

Comment thread crates/ruff_python_parser/src/token_source.rs Outdated

MichaReiser added 2 commits May 26, 2026 14:21

Apply suggestion from @MichaReiser

68d03f8

Use token growth buckets for initial capacity

1e4aafa

MichaReiser changed the title ~~Approximate token vec size to reduce regrows~~ Reduce memory usage by dropping token-excess capacity and improve performance by approximating the initial tokens vec size May 26, 2026

MichaReiser merged commit 3856288 into main May 26, 2026
58 checks passed

MichaReiser deleted the micha/view-benchmark-token-capacity branch May 26, 2026 17:08

BrewTestBot mentioned this pull request May 28, 2026

ruff 0.15.15 Homebrew/homebrew-core#285097

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage by dropping token-excess capacity and improve performance by approximating the initial tokens vec size#25354

Reduce memory usage by dropping token-excess capacity and improve performance by approximating the initial tokens vec size#25354
MichaReiser merged 8 commits into
mainfrom
micha/view-benchmark-token-capacity

MichaReiser commented May 23, 2026 •

edited

Loading

Uh oh!

astral-sh-bot Bot commented May 23, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented May 23, 2026 •

edited

Loading

Uh oh!

astral-sh-bot Bot commented May 24, 2026 •

edited

Loading

Uh oh!

astral-sh-bot Bot commented May 24, 2026 •

edited

Loading

flake8

trio

sphinx

prefect

Uh oh!

dhruvmanila commented May 25, 2026 •

edited

Loading

Uh oh!

charliermarsh commented May 25, 2026

Uh oh!

Uh oh!

Uh oh!

MichaReiser commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MichaReiser commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

astral-sh-bot Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

Uh oh!

codspeed-hq Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Uh oh!

astral-sh-bot Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ecosystem-analyzer results

Uh oh!

astral-sh-bot Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Memory usage report

Summary

Significant changes

flake8

trio

sphinx

prefect

Uh oh!

dhruvmanila commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charliermarsh commented May 25, 2026

Uh oh!

Uh oh!

Uh oh!

MichaReiser commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MichaReiser commented May 23, 2026 •

edited

Loading

astral-sh-bot Bot commented May 23, 2026 •

edited

Loading

`ruff-ecosystem` results

codspeed-hq Bot commented May 23, 2026 •

edited

Loading

astral-sh-bot Bot commented May 24, 2026 •

edited

Loading

`ecosystem-analyzer` results

astral-sh-bot Bot commented May 24, 2026 •

edited

Loading

dhruvmanila commented May 25, 2026 •

edited

Loading