Skip to content

Reduce memory usage by dropping token-excess capacity and improve performance by approximating the initial tokens vec size#25354

Merged
MichaReiser merged 8 commits into
mainfrom
micha/view-benchmark-token-capacity
May 26, 2026
Merged

Reduce memory usage by dropping token-excess capacity and improve performance by approximating the initial tokens vec size#25354
MichaReiser merged 8 commits into
mainfrom
micha/view-benchmark-token-capacity

Conversation

@MichaReiser

@MichaReiser MichaReiser commented May 23, 2026

Copy link
Copy Markdown
Member

Summary

This PR plugs the existing allocate_tokens_vec(contents) heuristic into our parser. The idea of allocate_tokens_vec is to approximate the size of the allocations Vec based on the source text size instead of starting from an empty Vec, to reduce the number of regrows during lexing. Reducing the allocations helps with performance. However, this isn't a zero-cost optimization because it increases memory usage when we over-approximate. This PR adds a shrink_to_fit call once parsing's finished to mitigate the memory usage increase to peak-memory usage only. However, shrink_to_fit also isn't free; it can require a reallocation, depending on the amount of excess capacity. However, calling shrink_to_fit shows that it does reduce ty's memory usage.

The parser memory usage benchmarks look very concerning. I think they're a bit overdramatic. What they show is that peak memory usage increases, which is accurate. However, for most benchmarks, the number of allocations reduces. Which includes many ty full-project runs, which I consider more meaningful.

For most benchmarks, performance increases by 1-2%. There are very few for which performance decreases by 1-2%. There are two parser benchmarks that stand out: One for which performance increases by 10% and one for which performance decreases by 5%.

Overall, I'm leaning towards landing this change, because it allows us to reduce memory usage in ty, while still suggesting a small perf improvement over all.

@astral-sh-bot

astral-sh-bot Bot commented May 23, 2026

Copy link
Copy Markdown

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@codspeed-hq

codspeed-hq Bot commented May 23, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 123 untouched benchmarks


Comparing micha/view-benchmark-token-capacity (1e4aafa) with main (258ca11)

Open in CodSpeed

@MichaReiser MichaReiser reopened this May 23, 2026
@MichaReiser MichaReiser changed the title View benchmark results Approximate token vec size to reduce regrows May 24, 2026
@MichaReiser MichaReiser reopened this May 24, 2026
@astral-sh-bot

astral-sh-bot Bot commented May 24, 2026

Copy link
Copy Markdown

ecosystem-analyzer results

No diagnostic changes detected ✅

Full report with detailed diff (timing results)

@MichaReiser MichaReiser force-pushed the micha/view-benchmark-token-capacity branch from e02394f to 5879173 Compare May 24, 2026 15:24
@astral-sh-bot

astral-sh-bot Bot commented May 24, 2026

Copy link
Copy Markdown

Memory usage report

Summary

Project Old New Diff Outcome
flake8 46.91MB 46.12MB -1.69% (809.61kB) ⬇️
trio 112.89MB 111.64MB -1.11% (1.25MB) ⬇️
sphinx 260.93MB 259.39MB -0.59% (1.54MB) ⬇️
prefect 696.97MB 695.33MB -0.24% (1.64MB) ⬇️

Significant changes

Click to expand detailed breakdown

flake8

Name Old New Diff Outcome
parsed_module 16.34MB 15.55MB -4.84% (809.61kB) ⬇️

trio

Name Old New Diff Outcome
parsed_module 24.95MB 23.70MB -5.01% (1.25MB) ⬇️

sphinx

Name Old New Diff Outcome
parsed_module 30.37MB 28.84MB -5.06% (1.54MB) ⬇️

prefect

Name Old New Diff Outcome
parsed_module 31.98MB 30.35MB -5.13% (1.64MB) ⬇️

@MichaReiser MichaReiser added the parser Related to the parser label May 24, 2026
@MichaReiser MichaReiser marked this pull request as ready for review May 24, 2026 16:21
@MichaReiser MichaReiser requested a review from dhruvmanila as a code owner May 24, 2026 16:21
@dhruvmanila

dhruvmanila commented May 25, 2026

Copy link
Copy Markdown
Member

I think another issue here is that the source that's using the approximate the capacity is always the complete source text, even when the start_offset is a non-zero value. This means that the linter will always over-allocation for multiple cases like stringified annotations, specific rule based lexing, etc. Could we try to take start_offset into account when computing the approximate capacity and see if that reduces / fixes the regression? Or, do you think this shouldn't affect it?

Edit: tried it in #25382, it seems to at least fix the linter benchmark

@charliermarsh

Copy link
Copy Markdown
Member

(Oh nice, that's a good catch.)

@MichaReiser MichaReiser reopened this May 26, 2026
Comment thread crates/ruff_python_parser/src/token_source.rs Outdated
Comment thread crates/ruff_python_parser/src/token_source.rs Outdated
@MichaReiser

Copy link
Copy Markdown
Member Author

I updated the heuristic to reduce the instances where we allocate too much storage. I really like codex explanation here, which is why I copy paste some of it:

The optimization preallocates the parser's token buffer using only the remaining source length, without scanning source contents: This is intentionally an underestimate. The goal is not to predict the final token count precisely; it is to skip several small early Vec growth steps while avoiding the memory regressions caused by over-reserving.

Across 28,758 Python files sampled from Ruff ecosystem projects, a len / 8 estimate is at or below the actual stored-token count for approximately 77.5% of files. That makes it a useful conservative estimate: for most files, the buffer still grows if necessary, but starts far enough ahead to avoid several early reallocations.

Why underestimate?
Overestimating can make peak memory worse. A token-sparse file containing long strings, comments, or docstrings may have many source bytes but few tokens. If an arbitrary estimate reserves too much capacity, the parser can allocate a larger buffer than ordinary Vec growth would ever have reached.

Underestimating is cheaper: the buffer simply performs its normal later growth steps. We still receive most of the allocation reduction without turning uncommon sparse files into systematic memory regressions.

Why the minimum capacity cutoff?

For small files, preallocation has limited value. Reserving fewer than 128 tokens only skips small, cheap growth steps, while increasing the number of small files whose initial allocation exceeds their ordinary final capacity.

Using a 128-token cutoff keeps small files on normal Vec growth and concentrates the optimization on files where it saves meaningful work.

Why round down with ilog2()?
The power-of-two rounding aligns the initial capacity with the capacity buckets produced by geometric Vec growth.
For example, a file with 748 tokens normally grows to a capacity of 1024. An arbitrary initial capacity of 660 grows to 1320, increasing peak memory. Rounding the estimate down to 512 instead produces:

normal growth:       ... -> 512 -> 1024
preallocated growth: 512 -> 1024

The allocation starts later, but the final capacity remains in the same bucket as normal growth.

@MichaReiser MichaReiser changed the title Approximate token vec size to reduce regrows Reduce memory usage by dropping token-excess capacity and improve performance by approximating the initial tokens vec size May 26, 2026
@MichaReiser MichaReiser merged commit 3856288 into main May 26, 2026
58 checks passed
@MichaReiser MichaReiser deleted the micha/view-benchmark-token-capacity branch May 26, 2026 17:08
anishgirianish pushed a commit to anishgirianish/ruff that referenced this pull request May 28, 2026
…formance by approximating the initial tokens vec size (astral-sh#25354)

Co-authored-by: Dhruv Manilawala <dhruvmanila@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parser Related to the parser

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants