perf(pacquet): avoid per-entry tarball allocations#12131
Conversation
📝 WalkthroughWalkthroughRefactored tarball extraction to borrow file payloads directly from decompressed bytes using computed offsets, eliminating per-entry allocations. Updated function signature, offset validation, CAS writes, manifest parsing, call sites, and tests to consume borrowed slices instead of allocated buffers. ChangesTarball extraction memory optimization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Review Summary by QodoOptimize tarball extraction with zero-copy slice borrowing WalkthroughsDescription• Eliminate per-entry tarball allocations by borrowing file payloads as buffer slices • Use entries_with_seek + raw_file_position to access tar entry data directly • Add explicit validation for entry offsets, sizes, and bounds checking • Update function signature to accept decompressed tar bytes instead of Archive reference Diagramflowchart LR
A["Decompressed tar buffer"] -->|"entries_with_seek"| B["Archive iterator"]
B -->|"raw_file_position"| C["Calculate offset & size"]
C -->|"bounds check"| D["Borrow slice from buffer"]
D -->|"write_cas_file"| E["CAFS storage"]
F["Old: per-entry Vec allocation"] -.->|"eliminated"| E
File Changes1. pacquet/crates/tarball/src/lib.rs
|
There was a problem hiding this comment.
Pull request overview
Refactors pacquet's tarball extraction to borrow each regular-file entry's bytes directly from the already-decompressed buffer (via entries_with_seek + raw_file_position) rather than allocating a fresh Vec<u8> and calling read_to_end per entry. The download pipeline already holds the full decompressed tar in memory, so this removes per-entry allocations and copies on the hot extraction path.
Changes:
extract_tarball_entriesnow takes&[u8]and slices the entry payload out of the tar buffer, with explicit error mapping for offset/size overflow and short payloads.- Removed the
MAX_ENTRY_PREALLOC_BYTES/try_reserve/read_to_endmachinery;file_sizeis now derived from the header once and reused for the CAFS attrs. - The call site in
fetch_and_extract_onceand the three tests intests.rswere updated to pass the decompressedVec<u8>/&[u8]directly instead of wrapping inArchive::new(Cursor::new(..)).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pacquet/crates/tarball/src/lib.rs | Switches extraction to slice-based payload borrowing and rewrites surrounding doc comments and error mapping. |
| pacquet/crates/tarball/src/tests.rs | Updates the three extract_tarball_entries tests to the new &[u8] signature and drops the now-unused tar::Archive import. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pacquet/crates/tarball/src/lib.rs (1)
1665-1666: ⚡ Quick winDrop the compressed tarball buffer before walking the archive.
bufferstays alive until the closure returns, so this path still holds both the compressed and decompressed tarball during extraction. Releasing it right afterdecompress_gzip(...)lowers peak RSS on the exact large-package path this PR is optimizing.Proposed change
- let (cas_paths, pkg_files_idx) = { - let tar_data = decompress_gzip(&buffer, package_unpacked_size)?; - extract_tarball_entries(&tar_data, store_dir, ignore_file_pattern.as_deref())? - }; + let (cas_paths, pkg_files_idx) = { + let tar_data = decompress_gzip(&buffer, package_unpacked_size)?; + drop(buffer); + extract_tarball_entries(&tar_data, store_dir, ignore_file_pattern.as_deref())? + };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pacquet/crates/tarball/src/lib.rs` around lines 1665 - 1666, The compressed buffer remains live during extraction; after calling decompress_gzip(...) you should release the original compressed buffer before invoking extract_tarball_entries(...) to reduce peak memory. Specifically, after let tar_data = decompress_gzip(&buffer, package_unpacked_size)?; drop or move out the buffer (e.g. call drop(buffer) or limit its scope) so that buffer is freed prior to calling extract_tarball_entries(&tar_data, store_dir, ignore_file_pattern.as_deref())?. Ensure you reference the existing symbols buffer, tar_data, decompress_gzip, and extract_tarball_entries when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pacquet/crates/tarball/src/lib.rs`:
- Around line 1665-1666: The compressed buffer remains live during extraction;
after calling decompress_gzip(...) you should release the original compressed
buffer before invoking extract_tarball_entries(...) to reduce peak memory.
Specifically, after let tar_data = decompress_gzip(&buffer,
package_unpacked_size)?; drop or move out the buffer (e.g. call drop(buffer) or
limit its scope) so that buffer is freed prior to calling
extract_tarball_entries(&tar_data, store_dir, ignore_file_pattern.as_deref())?.
Ensure you reference the existing symbols buffer, tar_data, decompress_gzip, and
extract_tarball_entries when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: af1b296a-7a65-4cbf-9a07-d0a14274e367
📒 Files selected for processing (2)
pacquet/crates/tarball/src/lib.rspacquet/crates/tarball/src/tests.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Agent
- GitHub Check: Compile & Lint
- GitHub Check: Run benchmark on ubuntu-latest
- GitHub Check: Lint and Test (macos-latest)
- GitHub Check: Lint and Test (ubuntu-latest)
- GitHub Check: Lint and Test (windows-latest)
- GitHub Check: Run benchmark on ubuntu-latest
- GitHub Check: Code Coverage
- GitHub Check: Analyze (javascript)
🧰 Additional context used
📓 Path-based instructions (1)
pacquet/**/*.rs
📄 CodeRabbit inference engine (pacquet/AGENTS.md)
pacquet/**/*.rs: Log emissions are part of matching pnpm — when porting a function that firespnpm:<channel>events throughglobalLogger,logger.debug(...), orstreamParser.write(...), mirror the call site, payload, and ordering so@pnpm/cli.default-reporterparses pacquet's NDJSON the same way
Declare a newtype wrapper for branded string types instead of collapsing the brand into a plainStringor&strin Rust
If upstream TypeScript always validates before construction of a branded string, validate in the Rust wrapper too viaTryFrom<String>and/orFromStrand do not provide an infallible public constructor
If upstream TypeScript never validates a branded string, just brand for type-safety in Rust by exposing an infallibleFrom<String>constructor
If upstream TypeScript occasionally constructs a branded string without validation, exposefrom_str_uncheckedin Rust as an escape hatch alongside the validating constructor
Match upstream serde behavior for branded strings crossing JSON, YAML, or INI boundaries by using#[serde(try_from = "String")]for deserialization and#[serde(into = "String")]for serialization
Derive simple conversions for branded strings using#[derive(derive_more::From)]and#[derive(derive_more::Into)]instead of handwritingimplblocks; use manualimplonly when conversion needs custom logic
Model TypeScript string literal unions (like'auto' | 'always' | 'never') as Rustenums instead of newtype wrappers, since the set of valid values is closed
Treat TypeScript string template literal types (like`${string}@${string}`) the same as branded string types in Rust, using a newtype wrapper with validation
Follow the code style guide inCODE_STYLE_GUIDE.md— imports, modules, naming, ownership and borrowing, parameter type selection, trait bounds, pattern matching,pipe-trait, error handling, test layout, and cloning ofArcandRc
Choose owned vs. borrowed parameters to minimize copies; widen to t...
Files:
pacquet/crates/tarball/src/tests.rspacquet/crates/tarball/src/lib.rs
🧠 Learnings (3)
📚 Learning: 2026-05-20T19:40:55.051Z
Learnt from: zkochan
Repo: pnpm/pnpm PR: 11774
File: pacquet/crates/resolving-deps-resolver/src/resolve_peers.rs:0-0
Timestamp: 2026-05-20T19:40:55.051Z
Learning: In the pacquet Rust code, ensure the semver implementation uses the `node-semver` crate (not `nodejs-semver`). `node-semver`’s public API does not include a `satisfies_with_prerelease`-style method; prerelease-tolerant matching should be implemented inline by first calling `Range::satisfies`, and when it rejects a prerelease version, retry matching against a stripped `MAJOR.MINOR.PATCH` base of the prerelease version.
Applied to files:
pacquet/crates/tarball/src/tests.rspacquet/crates/tarball/src/lib.rs
📚 Learning: 2026-05-22T00:08:44.646Z
Learnt from: zkochan
Repo: pnpm/pnpm PR: 11837
File: pacquet/crates/resolving-npm-resolver/src/pick_package.rs:33-51
Timestamp: 2026-05-22T00:08:44.646Z
Learning: In the pnpm/pnpm repo’s pacquet Rust crates, do not flag Unicode ellipsis characters (U+2026, `…`) in Rust doc comments (`///` / `/** */`) as a lint violation. The pacquet crate’s `dylint.toml` only enables `perfectionist::derive_ordering`, and the Dylint `unicode-ellipsis` rule is not enabled for this project—so `…` in doc comments is an intentional, repo-consistent style.
Applied to files:
pacquet/crates/tarball/src/tests.rspacquet/crates/tarball/src/lib.rs
📚 Learning: 2026-05-20T23:07:58.444Z
Learnt from: zkochan
Repo: pnpm/pnpm PR: 11784
File: pacquet/crates/resolving-deps-resolver/src/hoist_peers.rs:120-133
Timestamp: 2026-05-20T23:07:58.444Z
Learning: When reviewing code in this pacquet Rust port, follow the upstream pnpm compatibility rule: only match pnpm’s behavior exactly. Do not propose review changes that intentionally deviate from pnpm’s documented/observed behavior, even if pnpm appears buggy. If you identify a real bug in pnpm behavior, the review should prioritize fixing it upstream in pnpm first, and avoid implementing a pnpm-behavior workaround here unless the same fix has already landed upstream.
Applied to files:
pacquet/crates/tarball/src/tests.rspacquet/crates/tarball/src/lib.rs
🔇 Additional comments (2)
pacquet/crates/tarball/src/lib.rs (1)
487-666: LGTM!pacquet/crates/tarball/src/tests.rs (1)
833-834: LGTM!Also applies to: 883-884, 930-932
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #12131 +/- ##
==========================================
+ Coverage 87.45% 87.46% +0.01%
==========================================
Files 260 261 +1
Lines 29376 29654 +278
==========================================
+ Hits 25690 25938 +248
- Misses 3686 3716 +30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Integrated-Benchmark Report (Linux)Scenario: Isolated linker: fresh restore, cold cache + cold store
BENCHMARK_REPORT.json{
"results": [
{
"command": "pacquet@HEAD",
"mean": 1.70374853402,
"stddev": 0.11053958535518738,
"median": 1.66679458822,
"user": 2.7833623,
"system": 1.93560454,
"min": 1.58600987822,
"max": 1.9415957512200002,
"times": [
1.7371711672199999,
1.6483899042199999,
1.6327107142200001,
1.84892311422,
1.6261256912200002,
1.67155824822,
1.6620309282199999,
1.68296994322,
1.58600987822,
1.9415957512200002
]
},
{
"command": "pacquet@main",
"mean": 1.68652493342,
"stddev": 0.06018480750545312,
"median": 1.6787335362200002,
"user": 2.8432492,
"system": 1.9404551399999999,
"min": 1.62694359122,
"max": 1.8206428002200001,
"times": [
1.7098751892200001,
1.63461128422,
1.68353713122,
1.62802364822,
1.6464015162199999,
1.8206428002200001,
1.70851187622,
1.6739299412200002,
1.7327723562200001,
1.62694359122
]
}
]
}Scenario: Isolated linker: fresh restore, hot cache + hot store
BENCHMARK_REPORT.json{
"results": [
{
"command": "pacquet@HEAD",
"mean": 0.47040111525999995,
"stddev": 0.016017131252914995,
"median": 0.46892206346,
"user": 0.3552162199999999,
"system": 0.7776565200000001,
"min": 0.45547453746,
"max": 0.5132427054600001,
"times": [
0.5132427054600001,
0.46562346246,
0.47248992346,
0.45849829146000004,
0.46169519446,
0.46913104346,
0.46871308346,
0.46991299946000004,
0.46922991146000004,
0.45547453746
]
},
{
"command": "pacquet@main",
"mean": 0.47258122116,
"stddev": 0.013634593834208092,
"median": 0.46817996146,
"user": 0.35305612,
"system": 0.7821651199999999,
"min": 0.46150947646,
"max": 0.5077127614600001,
"times": [
0.5077127614600001,
0.46464292146,
0.46797773846,
0.46318564546,
0.47754071246,
0.46838218446,
0.46481052046000004,
0.47885054046000003,
0.46150947646,
0.47119971046000003
]
}
]
}Scenario: Isolated linker: fresh install, cold cache + cold store
BENCHMARK_REPORT.json{
"results": [
{
"command": "pacquet@HEAD",
"mean": 2.04779925864,
"stddev": 0.02606799484525774,
"median": 2.05726639284,
"user": 4.0383606599999995,
"system": 1.9188167399999998,
"min": 2.01511766934,
"max": 2.08384384034,
"times": [
2.08384384034,
2.01589657034,
2.01964782234,
2.06281990134,
2.07017943534,
2.06626753734,
2.02563047334,
2.01511766934,
2.05171288434,
2.06687645234
]
},
{
"command": "pacquet@main",
"mean": 2.09280454324,
"stddev": 0.026068304985572152,
"median": 2.0944044808399997,
"user": 4.111063760000001,
"system": 1.93388454,
"min": 2.06077315434,
"max": 2.13335989234,
"times": [
2.08439739634,
2.11759298934,
2.1104969103399998,
2.1130125403399997,
2.07188174234,
2.06183682434,
2.10441156534,
2.13335989234,
2.06077315434,
2.07028241734
]
}
]
}Scenario: Isolated linker: fresh install, hot cache + hot store
BENCHMARK_REPORT.json{
"results": [
{
"command": "pacquet@HEAD",
"mean": 1.29409351636,
"stddev": 0.018481476259209262,
"median": 1.29370179206,
"user": 1.7088795600000002,
"system": 1.1125332,
"min": 1.26618463856,
"max": 1.31909688156,
"times": [
1.27904699456,
1.2818701375600001,
1.31909688156,
1.29723588356,
1.3157422565599999,
1.3013145315599999,
1.26618463856,
1.29016770056,
1.27582586556,
1.31445027356
]
},
{
"command": "pacquet@main",
"mean": 1.31570199056,
"stddev": 0.06367138817679749,
"median": 1.2977993745599998,
"user": 1.71469666,
"system": 1.1250215,
"min": 1.27543666056,
"max": 1.49282293156,
"times": [
1.28606975856,
1.29669927256,
1.27701566256,
1.31142200256,
1.27543666056,
1.31400223856,
1.30925824056,
1.49282293156,
1.29889947656,
1.2953936615600001
]
}
]
} |
|
| Branch | pr/12131 |
| Testbed | pacquet |
Click to view all benchmark results
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| isolated-linker.fresh-install.cold-cache.cold-store | 📈 view plot 🚷 view threshold | 2,047.80 ms(-13.05%)Baseline: 2,355.21 ms | 2,826.25 ms (72.46%) |
| isolated-linker.fresh-install.hot-cache.hot-store | 📈 view plot 🚷 view threshold | 1,294.09 ms(-15.39%)Baseline: 1,529.39 ms | 1,835.27 ms (70.51%) |
| isolated-linker.fresh-restore.cold-cache.cold-store | 📈 view plot 🚷 view threshold | 1,703.75 ms(-17.70%)Baseline: 2,070.09 ms | 2,484.10 ms (68.59%) |
| isolated-linker.fresh-restore.hot-cache.hot-store | 📈 view plot 🚷 view threshold | 470.40 ms(-28.42%)Baseline: 657.19 ms | 788.63 ms (59.65%) |
This updates pacquet's tarball extraction path to avoid allocating and filling a fresh
Vec<u8>for every regular-file entry in a tar archive.The download pipeline already buffers and decompresses the full tarball in memory. Instead of calling
read_to_endper entry, extraction now usesentries_with_seek+raw_file_positionto borrow each file payload as a slice of the decompressed tar buffer, then writes that slice directly into CAFS.The change keeps the existing safety checks around archive parsing and path validation, and adds explicit handling for invalid entry offsets/sizes.
Summary by CodeRabbit
Refactor
Tests