Summary
Investigation into where rolldown's peak heap lives during bundling, with one small fix already shipped and a prioritized list of larger opportunities. This issue is a reference for future memory work — concrete numbers, real call-stack attribution, and where the next wins are.
Methodology
- Workload: rome bundling fixture (1,193 modules — the standard rolldown bench preset).
- Tool:
dhat profiler wrapping the system allocator, run via a small mem_profile binary in crates/bench/ that drives BundleFactory against a MemoryFileSystem. Built with --profile=release-debug + strip = false so dhat traces resolve to symbol names.
- Comparisons done by toggling a patch with
git apply / git checkout and rebuilding clean between runs.
- All numbers are dhat
max_bytes (peak heap demand) unless labeled ru_maxrss (OS-level high-water).
Where peak heap lives (rome, 140.5 MiB peak)
By code component, attributed to the deepest rolldown/oxc frame in each dhat call stack:
| Component |
Peak |
% |
Notes |
| oxc bumpalo arena (AST) |
87.4 MiB |
62 % |
One arena per module, average ~75 KiB. 16 KiB initial chunks from oxc_parser::lexer::trivia_builder triggers it; 97 % of modules also need a second 16 KiB chunk from oxc_semantic::scoping::Scoping::reserve. |
| MemoryFileSystem (bench-only) |
17.9 |
13 % |
Not present in real disk-FS runs. |
| oxc_semantic |
13.7 MiB |
10 % |
Scoping::reserve (~19.6 MB of capacity reserves), AstNodes::reserve (just 4 huge blocks ~2 MB each), add_binding, create_symbol. Lives on the heap, not in the bumpalo arena. |
| rolldown_common |
5.0 MiB |
3.6 % |
EcmaView / NormalModule and per-module structs |
| ArcStr |
3.3 MiB |
2.3 % |
Reference-counted source strings |
| link/scan/ast_scanner |
~7 MiB |
~5 % |
Linking metadata, scan-time scratch |
| Sourcemap join |
1.8 MiB |
1.3 % |
SourceJoiner::join |
| SymbolRefDbForModule |
1.8 MiB |
1.3 % |
Per-module symbol DB |
| StmtInfo |
1.3 MiB |
0.9 % |
80 B × statements |
By stage where peak is reached:
| Stage |
Peak contribution |
| parse (oxc_parser + AST arena chunks) |
41.6 % |
| pre_process (oxc_semantic Scoping build) |
16.2 % |
| ast_scanner |
5.7 % |
| generate (codegen, finalize, sourcemap) |
11 % |
| link |
1.5 % |
| Unattributed (low-level alloc / IndexVec / hashbrown) |
22 % |
Time at gmax: ~99 % of runtime (during finalize_assets / sourcemap join), but the largest contributors are allocated during parse/pre_process and never freed before peak.
For comparison, three.js r108 (370 modules, larger per-module average) hits peak at ~60 % of runtime (during link), with ~760 MiB dominated by AST arenas at ~2 MiB per module. Time-at-gmax for threejs is parse/link-bound, not generate-bound.
Block-size distribution (rome)
| Size class |
% of peak |
Description |
| 8-16 KB |
22 % |
bumpalo 16 KiB default initial chunks (one or two per module) |
| 1-2 KB |
18 % |
bumpalo growth chunks + small Vec/IndexVec content |
| ≥1 MB |
19 % |
~15 huge buffers: source-code strings, sourcemap output, large IndexVec resizes |
| 16-256 KB |
16 % |
Mid bumpalo growth chunks, chunk output |
| 256-512 KB-1 MB |
11 % |
Big buffers |
| <512 B |
11 % |
Tens of thousands of small structs — SymbolRef, hashmap nodes |
What's already shipped
A single-line drop(std::mem::take(&mut self.link_output.ast_table)) after instantiate_chunks in render_chunk_to_assets. instantiate_chunks is the last reader of the per-module ASTs; releasing the bumpalo arenas there before minify_chunks re-parses chunk output into fresh arenas avoids the worst stacking.
Measured impact (rome, dhat max_bytes):
| Config |
BEFORE |
AFTER |
Δ |
| rome |
140.51 MiB |
110.11 MiB |
−30.4 MiB (−21.6 %) |
| rome --minify |
159.78 MiB |
113.04 MiB |
−46.7 MiB (−29.3 %) |
| rome --sourcemap --minify |
171.45 MiB |
~125 MiB |
~ −46 MiB |
| synth 2000 --minify |
121.6 MiB |
92.0 MiB |
−29.6 MiB (−24 %) |
| synth 5000 --minify |
309.8 MiB |
229.2 MiB |
−80.6 MiB (−26 %) |
| threejs r108 |
760.0 MiB |
760.0 MiB |
0 (see below) |
Tests: 1,714 / 1,714 affected tests pass (same 5 pre-existing failures as on main).
The ru_maxrss change is smaller (~ −2 to −14 MiB depending on workload) because both the system allocator and mimalloc retain freed pages rather than returning them to the OS. The heap-level reduction is what matters for memory-pressure scheduling and for steady-state RSS across multiple builds (watch mode, dev server).
Why threejs sees no improvement
threejs r108 has only 370 modules but each has a much larger bumpalo arena (~2 MiB avg vs rome's ~75 KiB). Peak heap is reached at link/pre-codegen, not after codegen, so dropping the AST table after codegen has nothing to free that was contributing to a post-codegen peak. The drop helps whenever post-codegen stages (minify, sourcemap collapse, finalize_assets) push heap above the link-time peak. Many small chunks + minify = big win; one giant arena = no win.
What was explored but not shipped
Box-pattern shrinking of rare EcmaView fields
Eight `FxHashMap`/`FxHashSet`/`HmrInfo` fields on `EcmaView` that are empty for most modules were rewritten as `Option<Box<...>>` (HMR info, enum_member_value_map, new_url_references, this_expr_replace_map, dummy_record_set, self_referenced_class_decl_symbol_ids, constant_export_map, import_attribute_map).
Result: `NormalModule` shrank 992 → 768 bytes, `EcmaView` shrank 872 → 648 bytes. Locked in by `const_assert!`. For rome's 1,193 modules: 260 KB peak heap saved (~0.18 %). Real but tiny next to the AST-drop win. Touches 15 files. Not worth bundling with the AST-drop fix; could be its own follow-up but the impact is marginal compared to oxc-side opportunities (see below).
Full type-system-enforced AST lifetime refactor
`LinkStage::link()` returning `(LinkStageOutput, IndexEcmaAst)` with the AST table flowing by value through `generate → finalize_modules → render_chunk_to_assets → instantiate_chunks → create_chunk_to_codegen_ret_map`, where the compiler drops it at the consumer's exit.
Result: ~2 MiB additional savings on rome (no minify) vs. the simple `mem::take` drop. Identical results on rome --minify (the realistic case). Touches 5 files with signature changes through the entire codegen pipeline. The type-system enforcement is nice in theory but the practical risk it guards against (someone reading `link_output.ast_table` after codegen) is essentially zero. Skipped in favor of the 6-line fix.
`Allocator::with_capacity(source.len())` for parser
Tried sizing each module's bumpalo arena initial chunk based on source length instead of bumpalo's 16 KiB default.
Result: regressed by ~3 MiB on rome. bumpalo's chunk-doubling growth means starting too small forces several extra growth chunks, and the doubling overshoots more than the default does for typical modules. A smarter heuristic (e.g. `max(source.len() * 4, 16384)`) might work but needs cross-corpus validation — naïve sizing is worse.
Opportunities (prioritized)
Tier A — oxc-side (largest potential)
- Reduce oxc bumpalo arena per-module footprint (~87 MiB / 62 % of rome peak). Each module's arena averages ~75 KiB; 97 % of modules need a second 16 KiB chunk because `Scoping::reserve` runs early and fills past the lexer's initial chunk. Two angles:
- Right-size the initial chunk per module — needs cross-corpus tuning (see "explored" above for what doesn't work).
- Provide an arena-shrink API in oxc that compacts after parse (~10-20 MiB recoverable; many arenas have substantial trailing free space after parse completes).
- Cap or shrink `oxc_semantic::Scoping` reservations (~14 MiB / 10 %). Top oxc_semantic allocator is `Scoping::reserve` at 19.6 MiB across 2,365 blocks (~8 KiB pre-reserved per module). Just 4 `AstNodes::reserve` blocks contribute 8.6 MiB. Likely over-reserving for worst-case capacity; `shrink_to_fit` after build would help.
- Smaller AST node representations — purely an oxc footprint reduction; the bulk of every module's arena is AST nodes + spans.
Tier B — rolldown-side
- Stream output instead of buffering. ~33 MiB of peak heap (rome, 19 % of total) is in 15 huge ≥1 MiB single allocations: large source-code strings, sourcemap join output, output-buffer assembly. Writing chunks/sourcemaps as they're produced rather than buffering up to `finalize_assets` would cut into this.
- Drop AST per-module during codegen, not after. This issue's fix drops the whole `ast_table` after codegen. A more granular version would take each module out of the table when its chunk is rendered, freeing arenas progressively. Estimated additional ~5-15 MiB on rome, more on threejs (which currently sees zero benefit because peak is at link time — but per-module drop wouldn't help threejs either since the peak moment doesn't include codegen).
- Shrink per-module struct overhead. Box-pattern on rare `EcmaView` fields (explored above) — about 220 B/module = ~260 KB on rome, scales linearly. Net positive but tiny.
Tier C — bench-only / measurement infra
- `MemoryFileSystem` in bench preloads every file's bytes (~18 MiB on rome). Not present in real production runs. Worth noting only because it inflates bench numbers — production peak is ~18 MiB lower than what the bench reports.
Measurement scaffolding (not in the PR)
If anyone wants to reproduce or extend these numbers:
- `crates/bench/src/bin/mem_profile.rs`: small harness that drives `BundleFactory` against a preloaded `MemoryFileSystem` and prints `dhat` `max_bytes` + `getrusage` `ru_maxrss`. Built with `--features dhat` (workspace dep `dhat = "0.3.3"`).
- Run: `cargo run -p bench --release --features dhat --bin mem_profile -- rome [--minify] [--sourcemap]`.
- For resolved stack traces in `dhat-heap.json`, build with `--profile=release-debug` and `strip = false` in the profile.
Happy to upstream the harness as a separate PR if useful.
Related PR
The 6-line fix:
```diff
- // `instantiate_chunks` is the last reader of `ast_table`. Release the
- // per-module bumpalo arenas now so the heap dip is in place before
- // `minify_chunks` re-parses chunk output into fresh arenas (~30 MiB peak
- // heap reduction on rome).
- drop(std::mem::take(&mut self.link_output.ast_table));
```
In `crates/rolldown/src/stages/generate_stage/render_chunk_to_assets.rs`, immediately after `instantiate_chunks` returns.
Summary
Investigation into where rolldown's peak heap lives during bundling, with one small fix already shipped and a prioritized list of larger opportunities. This issue is a reference for future memory work — concrete numbers, real call-stack attribution, and where the next wins are.
Methodology
dhatprofiler wrapping the system allocator, run via a smallmem_profilebinary incrates/bench/that drivesBundleFactoryagainst aMemoryFileSystem. Built with--profile=release-debug+strip = falseso dhat traces resolve to symbol names.git apply/git checkoutand rebuilding clean between runs.max_bytes(peak heap demand) unless labeledru_maxrss(OS-level high-water).Where peak heap lives (rome, 140.5 MiB peak)
By code component, attributed to the deepest rolldown/oxc frame in each dhat call stack:
oxc_parser::lexer::trivia_buildertriggers it; 97 % of modules also need a second 16 KiB chunk fromoxc_semantic::scoping::Scoping::reserve.Scoping::reserve(~19.6 MB of capacity reserves),AstNodes::reserve(just 4 huge blocks ~2 MB each),add_binding,create_symbol. Lives on the heap, not in the bumpalo arena.EcmaView/NormalModuleand per-module structsSourceJoiner::joinBy stage where peak is reached:
Time at gmax: ~99 % of runtime (during
finalize_assets/ sourcemap join), but the largest contributors are allocated during parse/pre_process and never freed before peak.For comparison, three.js r108 (370 modules, larger per-module average) hits peak at ~60 % of runtime (during link), with ~760 MiB dominated by AST arenas at ~2 MiB per module. Time-at-gmax for threejs is parse/link-bound, not generate-bound.
Block-size distribution (rome)
Vec/IndexVeccontentIndexVecresizesSymbolRef, hashmap nodesWhat's already shipped
A single-line
drop(std::mem::take(&mut self.link_output.ast_table))afterinstantiate_chunksinrender_chunk_to_assets.instantiate_chunksis the last reader of the per-module ASTs; releasing the bumpalo arenas there beforeminify_chunksre-parses chunk output into fresh arenas avoids the worst stacking.Measured impact (rome, dhat
max_bytes):Tests: 1,714 / 1,714 affected tests pass (same 5 pre-existing failures as on
main).The
ru_maxrsschange is smaller (~ −2 to −14 MiB depending on workload) because both the system allocator and mimalloc retain freed pages rather than returning them to the OS. The heap-level reduction is what matters for memory-pressure scheduling and for steady-state RSS across multiple builds (watch mode, dev server).Why threejs sees no improvement
threejs r108 has only 370 modules but each has a much larger bumpalo arena (~2 MiB avg vs rome's ~75 KiB). Peak heap is reached at link/pre-codegen, not after codegen, so dropping the AST table after codegen has nothing to free that was contributing to a post-codegen peak. The drop helps whenever post-codegen stages (minify, sourcemap collapse, finalize_assets) push heap above the link-time peak. Many small chunks + minify = big win; one giant arena = no win.
What was explored but not shipped
Box-pattern shrinking of rare
EcmaViewfieldsEight `FxHashMap`/`FxHashSet`/`HmrInfo` fields on `EcmaView` that are empty for most modules were rewritten as `Option<Box<...>>` (HMR info, enum_member_value_map, new_url_references, this_expr_replace_map, dummy_record_set, self_referenced_class_decl_symbol_ids, constant_export_map, import_attribute_map).
Result: `NormalModule` shrank 992 → 768 bytes, `EcmaView` shrank 872 → 648 bytes. Locked in by `const_assert!`. For rome's 1,193 modules: 260 KB peak heap saved (~0.18 %). Real but tiny next to the AST-drop win. Touches 15 files. Not worth bundling with the AST-drop fix; could be its own follow-up but the impact is marginal compared to oxc-side opportunities (see below).
Full type-system-enforced AST lifetime refactor
`LinkStage::link()` returning `(LinkStageOutput, IndexEcmaAst)` with the AST table flowing by value through `generate → finalize_modules → render_chunk_to_assets → instantiate_chunks → create_chunk_to_codegen_ret_map`, where the compiler drops it at the consumer's exit.
Result: ~2 MiB additional savings on rome (no minify) vs. the simple `mem::take` drop. Identical results on rome --minify (the realistic case). Touches 5 files with signature changes through the entire codegen pipeline. The type-system enforcement is nice in theory but the practical risk it guards against (someone reading `link_output.ast_table` after codegen) is essentially zero. Skipped in favor of the 6-line fix.
`Allocator::with_capacity(source.len())` for parser
Tried sizing each module's bumpalo arena initial chunk based on source length instead of bumpalo's 16 KiB default.
Result: regressed by ~3 MiB on rome. bumpalo's chunk-doubling growth means starting too small forces several extra growth chunks, and the doubling overshoots more than the default does for typical modules. A smarter heuristic (e.g. `max(source.len() * 4, 16384)`) might work but needs cross-corpus validation — naïve sizing is worse.
Opportunities (prioritized)
Tier A — oxc-side (largest potential)
Tier B — rolldown-side
Tier C — bench-only / measurement infra
Measurement scaffolding (not in the PR)
If anyone wants to reproduce or extend these numbers:
Happy to upstream the harness as a separate PR if useful.
Related PR
The 6-line fix:
```diff
```
In `crates/rolldown/src/stages/generate_stage/render_chunk_to_assets.rs`, immediately after `instantiate_chunks` returns.