Reduce max RSS during bundling: memory map + opportunities

## Summary

Investigation into where rolldown's peak heap lives during bundling, with one small fix already shipped and a prioritized list of larger opportunities. This issue is a reference for future memory work — concrete numbers, real call-stack attribution, and where the next wins are.

## Methodology

- Workload: rome bundling fixture (1,193 modules — the standard rolldown bench preset).
- Tool: `dhat` profiler wrapping the system allocator, run via a small `mem_profile` binary in `crates/bench/` that drives `BundleFactory` against a `MemoryFileSystem`. Built with `--profile=release-debug` + `strip = false` so dhat traces resolve to symbol names.
- Comparisons done by toggling a patch with `git apply` / `git checkout` and rebuilding clean between runs.
- All numbers are dhat `max_bytes` (peak heap demand) unless labeled `ru_maxrss` (OS-level high-water).

## Where peak heap lives (rome, 140.5 MiB peak)

By code component, attributed to the deepest rolldown/oxc frame in each dhat call stack:

| Component | Peak | % | Notes |
|---|---:|---:|---|
| **oxc bumpalo arena (AST)** | **87.4 MiB** | **62 %** | One arena per module, average ~75 KiB. 16 KiB initial chunks from `oxc_parser::lexer::trivia_builder` triggers it; 97 % of modules also need a second 16 KiB chunk from `oxc_semantic::scoping::Scoping::reserve`. |
| MemoryFileSystem (bench-only) | 17.9 | 13 % | Not present in real disk-FS runs. |
| **oxc_semantic** | **13.7 MiB** | **10 %** | `Scoping::reserve` (~19.6 MB of capacity reserves), `AstNodes::reserve` (just 4 huge blocks ~2 MB each), `add_binding`, `create_symbol`. **Lives on the heap, not in the bumpalo arena.** |
| rolldown_common | 5.0 MiB | 3.6 % | `EcmaView` / `NormalModule` and per-module structs |
| ArcStr | 3.3 MiB | 2.3 % | Reference-counted source strings |
| link/scan/ast_scanner | ~7 MiB | ~5 % | Linking metadata, scan-time scratch |
| Sourcemap join | 1.8 MiB | 1.3 % | `SourceJoiner::join` |
| SymbolRefDbForModule | 1.8 MiB | 1.3 % | Per-module symbol DB |
| StmtInfo | 1.3 MiB | 0.9 % | 80 B × statements |

By stage where peak is reached:

| Stage | Peak contribution |
|---|---:|
| parse (oxc_parser + AST arena chunks) | 41.6 % |
| pre_process (oxc_semantic Scoping build) | 16.2 % |
| ast_scanner | 5.7 % |
| generate (codegen, finalize, sourcemap) | 11 % |
| link | 1.5 % |
| Unattributed (low-level alloc / IndexVec / hashbrown) | 22 % |

**Time at gmax**: ~99 % of runtime (during `finalize_assets` / sourcemap join), but the largest contributors are allocated during parse/pre_process and never freed before peak.

For comparison, three.js r108 (370 modules, larger per-module average) hits peak at **~60 %** of runtime (during link), with **~760 MiB** dominated by AST arenas at ~2 MiB per module. Time-at-gmax for threejs is parse/link-bound, not generate-bound.

## Block-size distribution (rome)

| Size class | % of peak | Description |
|---|---:|---|
| 8-16 KB | 22 % | bumpalo 16 KiB default initial chunks (one or two per module) |
| 1-2 KB | 18 % | bumpalo growth chunks + small `Vec`/`IndexVec` content |
| ≥1 MB | 19 % | ~15 huge buffers: source-code strings, sourcemap output, large `IndexVec` resizes |
| 16-256 KB | 16 % | Mid bumpalo growth chunks, chunk output |
| 256-512 KB-1 MB | 11 % | Big buffers |
| <512 B | 11 % | Tens of thousands of small structs — `SymbolRef`, hashmap nodes |

## What's already shipped

A single-line `drop(std::mem::take(&mut self.link_output.ast_table))` after `instantiate_chunks` in `render_chunk_to_assets`. `instantiate_chunks` is the last reader of the per-module ASTs; releasing the bumpalo arenas there before `minify_chunks` re-parses chunk output into fresh arenas avoids the worst stacking.

**Measured impact** (rome, dhat `max_bytes`):

| Config | BEFORE | AFTER | Δ |
|---|---:|---:|---:|
| rome | 140.51 MiB | 110.11 MiB | **−30.4 MiB (−21.6 %)** |
| rome --minify | 159.78 MiB | 113.04 MiB | **−46.7 MiB (−29.3 %)** |
| rome --sourcemap --minify | 171.45 MiB | ~125 MiB | ~ −46 MiB |
| synth 2000 --minify | 121.6 MiB | 92.0 MiB | −29.6 MiB (−24 %) |
| synth 5000 --minify | 309.8 MiB | 229.2 MiB | −80.6 MiB (−26 %) |
| threejs r108 | 760.0 MiB | 760.0 MiB | 0 (see below) |

**Tests**: 1,714 / 1,714 affected tests pass (same 5 pre-existing failures as on `main`).

The `ru_maxrss` change is smaller (~ −2 to −14 MiB depending on workload) because both the system allocator and mimalloc retain freed pages rather than returning them to the OS. The heap-level reduction is what matters for memory-pressure scheduling and for steady-state RSS across multiple builds (watch mode, dev server).

## Why threejs sees no improvement

threejs r108 has only 370 modules but each has a much larger bumpalo arena (~2 MiB avg vs rome's ~75 KiB). Peak heap is reached at link/pre-codegen, not after codegen, so dropping the AST table after codegen has nothing to free that was contributing to a post-codegen peak. **The drop helps whenever post-codegen stages (minify, sourcemap collapse, finalize_assets) push heap above the link-time peak.** Many small chunks + minify = big win; one giant arena = no win.

## What was explored but not shipped

### Box-pattern shrinking of rare `EcmaView` fields

Eight \`FxHashMap\`/\`FxHashSet\`/\`HmrInfo\` fields on \`EcmaView\` that are empty for most modules were rewritten as \`Option<Box<...>>\` (HMR info, enum_member_value_map, new_url_references, this_expr_replace_map, dummy_record_set, self_referenced_class_decl_symbol_ids, constant_export_map, import_attribute_map).

**Result**: \`NormalModule\` shrank 992 → 768 bytes, \`EcmaView\` shrank 872 → 648 bytes. Locked in by \`const_assert!\`. For rome's 1,193 modules: **260 KB peak heap saved (~0.18 %)**. Real but tiny next to the AST-drop win. Touches 15 files. Not worth bundling with the AST-drop fix; could be its own follow-up but the impact is marginal compared to oxc-side opportunities (see below).

### Full type-system-enforced AST lifetime refactor

\`LinkStage::link()\` returning \`(LinkStageOutput, IndexEcmaAst)\` with the AST table flowing by value through \`generate → finalize_modules → render_chunk_to_assets → instantiate_chunks → create_chunk_to_codegen_ret_map\`, where the compiler drops it at the consumer's exit.

**Result**: ~2 MiB additional savings on rome (no minify) vs. the simple \`mem::take\` drop. **Identical** results on rome --minify (the realistic case). Touches 5 files with signature changes through the entire codegen pipeline. The type-system enforcement is nice in theory but the practical risk it guards against (someone reading \`link_output.ast_table\` after codegen) is essentially zero. **Skipped in favor of the 6-line fix.**

### \`Allocator::with_capacity(source.len())\` for parser

Tried sizing each module's bumpalo arena initial chunk based on source length instead of bumpalo's 16 KiB default.

**Result**: **regressed by ~3 MiB** on rome. bumpalo's chunk-doubling growth means starting too small forces several extra growth chunks, and the doubling overshoots more than the default does for typical modules. A smarter heuristic (e.g. \`max(source.len() * 4, 16384)\`) might work but needs cross-corpus validation — naïve sizing is worse.

## Opportunities (prioritized)

### Tier A — oxc-side (largest potential)

1. **Reduce oxc bumpalo arena per-module footprint (~87 MiB / 62 % of rome peak).** Each module's arena averages ~75 KiB; 97 % of modules need a second 16 KiB chunk because \`Scoping::reserve\` runs early and fills past the lexer's initial chunk. Two angles:
   - Right-size the initial chunk per module — needs cross-corpus tuning (see "explored" above for what doesn't work).
   - Provide an arena-shrink API in oxc that compacts after parse (~10-20 MiB recoverable; many arenas have substantial trailing free space after parse completes).
2. **Cap or shrink \`oxc_semantic::Scoping\` reservations (~14 MiB / 10 %).** Top oxc_semantic allocator is \`Scoping::reserve\` at 19.6 MiB across 2,365 blocks (~8 KiB pre-reserved per module). Just 4 \`AstNodes::reserve\` blocks contribute 8.6 MiB. Likely over-reserving for worst-case capacity; \`shrink_to_fit\` after build would help.
3. **Smaller AST node representations** — purely an oxc footprint reduction; the bulk of every module's arena is AST nodes + spans.

### Tier B — rolldown-side

4. **Stream output instead of buffering.** ~33 MiB of peak heap (rome, 19 % of total) is in 15 huge ≥1 MiB single allocations: large source-code strings, sourcemap join output, output-buffer assembly. Writing chunks/sourcemaps as they're produced rather than buffering up to \`finalize_assets\` would cut into this.
5. **Drop AST per-module during codegen, not after.** This issue's fix drops the whole \`ast_table\` after codegen. A more granular version would take each module out of the table when its chunk is rendered, freeing arenas progressively. Estimated additional ~5-15 MiB on rome, more on threejs (which currently sees zero benefit because peak is at link time — but per-module drop wouldn't help threejs either since the peak moment doesn't include codegen).
6. **Shrink per-module struct overhead.** Box-pattern on rare \`EcmaView\` fields (explored above) — about 220 B/module = ~260 KB on rome, scales linearly. Net positive but tiny.

### Tier C — bench-only / measurement infra

7. \`MemoryFileSystem\` in bench preloads every file's bytes (~18 MiB on rome). Not present in real production runs. Worth noting only because it inflates bench numbers — production peak is ~18 MiB lower than what the bench reports.

## Measurement scaffolding (not in the PR)

If anyone wants to reproduce or extend these numbers:

- \`crates/bench/src/bin/mem_profile.rs\`: small harness that drives \`BundleFactory\` against a preloaded \`MemoryFileSystem\` and prints \`dhat\` \`max_bytes\` + \`getrusage\` \`ru_maxrss\`. Built with \`--features dhat\` (workspace dep \`dhat = \"0.3.3\"\`).
- Run: \`cargo run -p bench --release --features dhat --bin mem_profile -- rome [--minify] [--sourcemap]\`.
- For resolved stack traces in \`dhat-heap.json\`, build with \`--profile=release-debug\` and \`strip = false\` in the profile.

Happy to upstream the harness as a separate PR if useful.

## Related PR

The 6-line fix:

\`\`\`diff
+    // \`instantiate_chunks\` is the last reader of \`ast_table\`. Release the
+    // per-module bumpalo arenas now so the heap dip is in place before
+    // \`minify_chunks\` re-parses chunk output into fresh arenas (~30 MiB peak
+    // heap reduction on rome).
+    drop(std::mem::take(&mut self.link_output.ast_table));
\`\`\`

In \`crates/rolldown/src/stages/generate_stage/render_chunk_to_assets.rs\`, immediately after \`instantiate_chunks\` returns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce max RSS during bundling: memory map + opportunities #9516

Summary

Methodology

Where peak heap lives (rome, 140.5 MiB peak)

Block-size distribution (rome)

What's already shipped

Why threejs sees no improvement

What was explored but not shipped

Box-pattern shrinking of rare `EcmaView` fields

Full type-system-enforced AST lifetime refactor

`Allocator::with_capacity(source.len())` for parser

Opportunities (prioritized)

Tier A — oxc-side (largest potential)

Tier B — rolldown-side

Tier C — bench-only / measurement infra

Measurement scaffolding (not in the PR)

Related PR

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Peak	%	Notes
oxc bumpalo arena (AST)	87.4 MiB	62 %	One arena per module, average ~75 KiB. 16 KiB initial chunks from `oxc_parser::lexer::trivia_builder` triggers it; 97 % of modules also need a second 16 KiB chunk from `oxc_semantic::scoping::Scoping::reserve`.
MemoryFileSystem (bench-only)	17.9	13 %	Not present in real disk-FS runs.
oxc_semantic	13.7 MiB	10 %	`Scoping::reserve` (~19.6 MB of capacity reserves), `AstNodes::reserve` (just 4 huge blocks ~2 MB each), `add_binding`, `create_symbol`. Lives on the heap, not in the bumpalo arena.
rolldown_common	5.0 MiB	3.6 %	`EcmaView` / `NormalModule` and per-module structs
ArcStr	3.3 MiB	2.3 %	Reference-counted source strings
link/scan/ast_scanner	~7 MiB	~5 %	Linking metadata, scan-time scratch
Sourcemap join	1.8 MiB	1.3 %	`SourceJoiner::join`
SymbolRefDbForModule	1.8 MiB	1.3 %	Per-module symbol DB
StmtInfo	1.3 MiB	0.9 %	80 B × statements

Stage	Peak contribution
parse (oxc_parser + AST arena chunks)	41.6 %
pre_process (oxc_semantic Scoping build)	16.2 %
ast_scanner	5.7 %
generate (codegen, finalize, sourcemap)	11 %
link	1.5 %
Unattributed (low-level alloc / IndexVec / hashbrown)	22 %

Size class	% of peak	Description
8-16 KB	22 %	bumpalo 16 KiB default initial chunks (one or two per module)
1-2 KB	18 %	bumpalo growth chunks + small `Vec`/`IndexVec` content
≥1 MB	19 %	~15 huge buffers: source-code strings, sourcemap output, large `IndexVec` resizes
16-256 KB	16 %	Mid bumpalo growth chunks, chunk output
256-512 KB-1 MB	11 %	Big buffers
<512 B	11 %	Tens of thousands of small structs — `SymbolRef`, hashmap nodes

Config	BEFORE	AFTER	Δ
rome	140.51 MiB	110.11 MiB	−30.4 MiB (−21.6 %)
rome --minify	159.78 MiB	113.04 MiB	−46.7 MiB (−29.3 %)
rome --sourcemap --minify	171.45 MiB	~125 MiB	~ −46 MiB
synth 2000 --minify	121.6 MiB	92.0 MiB	−29.6 MiB (−24 %)
synth 5000 --minify	309.8 MiB	229.2 MiB	−80.6 MiB (−26 %)
threejs r108	760.0 MiB	760.0 MiB	0 (see below)

Reduce max RSS during bundling: memory map + opportunities #9516

Description

Summary

Methodology

Where peak heap lives (rome, 140.5 MiB peak)

Block-size distribution (rome)

What's already shipped

Why threejs sees no improvement

What was explored but not shipped

Box-pattern shrinking of rare EcmaView fields

Full type-system-enforced AST lifetime refactor

`Allocator::with_capacity(source.len())` for parser

Opportunities (prioritized)

Tier A — oxc-side (largest potential)

Tier B — rolldown-side

Tier C — bench-only / measurement infra

Measurement scaffolding (not in the PR)

Related PR

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Box-pattern shrinking of rare `EcmaView` fields