Skip to content

perf(generate): drop ast_table after instantiate_chunks#9554

Closed
Boshen wants to merge 1 commit into
mainfrom
perf/drop-ast-table-after-instantiate-chunks
Closed

perf(generate): drop ast_table after instantiate_chunks#9554
Boshen wants to merge 1 commit into
mainfrom
perf/drop-ast-table-after-instantiate-chunks

Conversation

@Boshen

@Boshen Boshen commented May 26, 2026

Copy link
Copy Markdown
Member

Summary

Per-module bumpalo arenas live in LinkStageOutput.ast_table and dominate rolldown's peak heap during bundling (~62 % on the rome benchmark, 1,193 modules). instantiate_chunks is the last reader of those ASTs — nothing between it and end-of-bundle needs the per-module ASTs, but they currently survive past minify_chunks (which re-parses chunk output into fresh arenas) and finalize_assets (which allocates large output buffers).

This PR drops ast_table immediately after instantiate_chunks returns, freeing the bumpalo arenas before the post-codegen stages allocate.

Measured impact (3 runs each, verified locally with dhat + getrusage + /usr/bin/time -ahl)

Config main This PR Δ
dhat At t-gmax (peak heap)
rome 140.51 MiB 110.15 MiB −30.4 MiB (−21.6 %)
rome --minify 159.78 MiB 113.08 MiB −46.7 MiB (−29.2 %)
ru_maxrss (getrusage, 3-run avg)
rome 172.05 MiB 169.39 MiB −2.66 MiB
rome --minify 183.56 MiB 181.41 MiB −2.15 MiB
time -ahl peak RSS (3-run avg)
rome 180.99 MiB 178.16 MiB −2.83 MiB
rome --minify 193.37 MiB 191.11 MiB −2.26 MiB

Larger synthetic and minify workloads from the issue (#9516) scale similarly: ~−24 % on synth 2000 --minify, ~−26 % on synth 5000 --minify. threejs r108 sees no change because its peak is at link, not post-codegen.

ru_maxrss deltas are smaller because both the system allocator and mimalloc retain freed pages, but the heap-level reduction is what matters for memory-pressure scheduling and steady-state RSS across multiple builds (watch mode, dev server).

Alternative considered

A type-system-enforced version (#9555) threads IndexEcmaAst by value from LinkStage::link() through generate → finalize_modules → render_chunk_to_assets → instantiate_chunks → create_chunk_to_codegen_ret_map, removing ast_table from LinkStageOutput entirely. It touches 5 files and saves ~2 MiB more on rome (no minify); identical results on rome --minify. The compile-time guarantee guards against a future reader being reintroduced after codegen — but the practical risk is essentially zero, so this minimal version is preferred.

Refs #9516.

…k heap

Per-module bumpalo arenas live in `LinkStageOutput.ast_table` and are
the largest single component of rolldown's peak heap (~62% on rome,
1,193 modules). `instantiate_chunks` is the last reader; nothing
between it and end-of-bundle needs the per-module ASTs, but they
currently survive past `minify_chunks` (re-parses chunk output into
fresh arenas) and `finalize_assets` (allocates large output buffers).

Releasing the arenas right after `instantiate_chunks` cuts peak heap
~21.6% on rome and ~29.3% on rome --minify, per dhat measurements in
issue #9516.

Refs #9516.
@netlify

netlify Bot commented May 26, 2026

Copy link
Copy Markdown

Deploy Preview for rolldown-rs canceled.

Name Link
🔨 Latest commit b22efc0
🔍 Latest deploy log https://app.netlify.com/projects/rolldown-rs/deploys/6a151676e06e12000839550f

@codspeed-hq

codspeed-hq Bot commented May 26, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 4 untouched benchmarks
⏩ 10 skipped benchmarks1


Comparing perf/drop-ast-table-after-instantiate-chunks (b22efc0) with main (df616cb)2

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (b291797) during the generation of this report, so df616cb was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@Boshen Boshen closed this May 26, 2026
@Boshen Boshen deleted the perf/drop-ast-table-after-instantiate-chunks branch May 26, 2026 12:00
graphite-app Bot pushed a commit that referenced this pull request May 26, 2026
## Summary

Alternative to #9554. Same memory win on the realistic case, enforced by the type system instead of by a comment.

Remove `ast_table` from `LinkStageOutput` and change `LinkStage::link()` to return `(LinkStageOutput, IndexEcmaAst)`. The AST table then flows by value through `bundle_up → GenerateStage::new → generate → finalize_modules → render_chunk_to_assets → instantiate_chunks → create_chunk_to_codegen_ret_map`, where the compiler drops it at scope exit (`create_chunk_to_codegen_ret_map` is the last reader). Adding a new post-codegen `ast_table` reader becomes a compile error — there is no field to read.

## Measured impact (3 runs each, verified locally with dhat + getrusage + `/usr/bin/time -ahl`)

| Config | main | This PR | Δ vs main | Δ vs #9554 |
|---|---:|---:|---:|---:|
| **dhat At t-gmax (peak heap)** | | | | |
| rome | 140.51 MiB | 108.07 MiB | **−32.4 MiB (−23.1 %)** | −2.08 MiB |
| rome --minify | 159.78 MiB | 113.09 MiB | **−46.7 MiB (−29.2 %)** | +0.01 MiB |
| **ru_maxrss (getrusage, 3-run avg)** | | | | |
| rome | 172.05 MiB | 170.56 MiB | −1.49 MiB | +1.17 MiB |
| rome --minify | 183.56 MiB | 175.71 MiB | −7.85 MiB | −5.70 MiB |
| **time -ahl peak RSS (3-run avg)** | | | | |
| rome | 180.99 MiB | 179.31 MiB | −1.68 MiB | +1.15 MiB |
| rome --minify | 193.37 MiB | 184.69 MiB | −8.68 MiB | −6.42 MiB |

The ~2 MiB additional saving on rome no-minify vs #9554 comes from dropping one frame earlier (inside `instantiate_chunks` rather than after it returns) — which matters when peak heap is reached during `instantiate_chunks`'s own `try_join_all`. On rome --minify, peak is firmly in `finalize_assets`, far past both drop points, so the two PRs are identical at the dhat level (the OS-RSS gap there is run-to-run noise — system malloc's page caching is sticky).

## Trade-offs vs #9554

- Pros: Compile-time guarantee that no post-codegen reader can exist.
- Cons: 5 files instead of 1; signature changes through the codegen pipeline.

Pick one to land — they are mutually exclusive. The minimal version (#9554) is simpler; this one is what you'd reach for if the team values type-system enforcement over diff size.

Refs #9516.
IWANABETHATGUY pushed a commit that referenced this pull request May 26, 2026
## Summary

Alternative to #9554. Same memory win on the realistic case, enforced by
the type system instead of by a comment.

Remove `ast_table` from `LinkStageOutput` and change `LinkStage::link()`
to return `(LinkStageOutput, IndexEcmaAst)`. The AST table then flows by
value through `bundle_up → GenerateStage::new → generate →
finalize_modules → render_chunk_to_assets → instantiate_chunks →
create_chunk_to_codegen_ret_map`, where the compiler drops it at scope
exit (`create_chunk_to_codegen_ret_map` is the last reader). Adding a
new post-codegen `ast_table` reader becomes a compile error — there is
no field to read.

## Measured impact (3 runs each, verified locally with dhat + getrusage
+ `/usr/bin/time -ahl`)

| Config | main | This PR | Δ vs main | Δ vs #9554 |
|---|---:|---:|---:|---:|
| **dhat At t-gmax (peak heap)** | | | | |
| rome | 140.51 MiB | 108.07 MiB | **−32.4 MiB (−23.1 %)** | −2.08 MiB |
| rome --minify | 159.78 MiB | 113.09 MiB | **−46.7 MiB (−29.2 %)** |
+0.01 MiB |
| **ru_maxrss (getrusage, 3-run avg)** | | | | |
| rome | 172.05 MiB | 170.56 MiB | −1.49 MiB | +1.17 MiB |
| rome --minify | 183.56 MiB | 175.71 MiB | −7.85 MiB | −5.70 MiB |
| **time -ahl peak RSS (3-run avg)** | | | | |
| rome | 180.99 MiB | 179.31 MiB | −1.68 MiB | +1.15 MiB |
| rome --minify | 193.37 MiB | 184.69 MiB | −8.68 MiB | −6.42 MiB |

The ~2 MiB additional saving on rome no-minify vs #9554 comes from
dropping one frame earlier (inside `instantiate_chunks` rather than
after it returns) — which matters when peak heap is reached during
`instantiate_chunks`'s own `try_join_all`. On rome --minify, peak is
firmly in `finalize_assets`, far past both drop points, so the two PRs
are identical at the dhat level (the OS-RSS gap there is run-to-run
noise — system malloc's page caching is sticky).

## Trade-offs vs #9554

- Pros: Compile-time guarantee that no post-codegen reader can exist.
- Cons: 5 files instead of 1; signature changes through the codegen
pipeline.

Pick one to land — they are mutually exclusive. The minimal version
(#9554) is simpler; this one is what you'd reach for if the team values
type-system enforcement over diff size.

Refs #9516.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant