perf: port pnpm v11 HTTP/fetch topology and worker-pool sizing

Follow-up to [`investigations/pacquet-macos-perf.md`](https://github.com/pnpm/investigations) after #276, #277, and #279 landed. Fresh comparison against `pnpm/pnpm v11`'s `deps-restorer` + `network/fetch` layer surfaced several concrete deltas where pnpm's current code is doing something deliberate that pacquet isn't. Collected here so we can pick them off as individual PRs.

Pacquet is now ~2× faster than pnpm on the Linux cold-install FrozenLockfile benchmark (2.9 s vs 6.4 s), but [the macOS investigation doc](https://github.com/pnpm/investigations/blob/main/pacquet-macos-perf.md) originally flagged pacquet as ~2× *slower* than pnpm on macOS. Most items below should help narrow the macOS gap specifically — all of them are either network-topology choices pnpm has explicit benchmark evidence for, or CPU/IO cap tunings where pacquet's defaults diverge from upstream.

## Findings

### 1. HTTP/2 is silently enabled in pacquet; pnpm deliberately disables it

**Upstream**: `network/fetch/src/dispatcher.ts:17-22` (pnpm v11):
> Note: we intentionally do NOT enable HTTP/2 (allowH2) or HTTP/1.1 pipelining here. With HTTP/2, undici multiplexes many streams over 1-2 TCP connections sharing a single congestion window. In benchmarks this was slower than opening ~50 independent HTTP/1.1 connections that each get their own congestion window and can saturate bandwidth in parallel.

**Pacquet**: `crates/network/src/lib.rs:50-56` builds a default `reqwest::Client` which negotiates HTTP/2 via ALPN whenever the registry advertises it (registry.npmjs.org does). We're getting the exact topology pnpm measured as slower.

**Fix**: `.http1_only(true)` on the `Client::builder`.

### 2. Concurrent connection cap is too low vs pnpm

**Upstream**: `network/fetch/src/dispatcher.ts:12, 23-24`:
```ts
const DEFAULT_MAX_SOCKETS = 50
setGlobalDispatcher(new Agent({ connections: DEFAULT_MAX_SOCKETS, ... }))
```

**Pacquet**: `crates/network/src/lib.rs:66-68`:
```rust
const MIN_PERMITS: usize = 16;
let semaphore = num_cpus::get().max(MIN_PERMITS).pipe(Semaphore::new);
```

On a 4-core GHA runner pacquet has 1/3 of pnpm's concurrent-fetch budget; on a 10-core M3 still 1/5. Cold installs are network-bound, so under-subscription directly stretches wall time.

**Fix**: raise the floor to match pnpm's `DEFAULT_MAX_SOCKETS = 50`. Keep a small `num_cpus` influence as a ceiling if we want to stay gentle on very-small machines, but the common case should sit at 50.

### 3. Tarball buffer is grown via doubling instead of pre-allocated from `Content-Length`

**Upstream**: `fetching/tarball-fetcher/src/remoteTarballFetcher.ts:148-164`:
```ts
if (size !== null) {
  // Known size: pre-allocate and copy directly (avoids intermediate array + second copy pass)
  data = Buffer.from(new SharedArrayBuffer(size))
  for await (const chunk of res.body!) {
    data.set(c, downloaded)
    downloaded += c.byteLength
  }
}
```

Pnpm v11 CHANGELOG note: *"Tarball downloads with known size now pre-allocate memory to avoid double-copy overhead."*

**Pacquet**: `crates/tarball/src/lib.rs:509` does `response_head.bytes().await`. reqwest/hyper internally grows a `BytesMut` by doubling when `Content-Length` isn't used to pre-size — multiple reallocs + copies per tarball × 1352 tarballs.

**Fix**: switch to `bytes_stream()`, check `Content-Length` on the response head, pre-allocate a `BytesMut::with_capacity(len)` when known, and copy chunks in sequentially. Also catches size-mismatch errors (pnpm's `BadTarballError`) that pacquet currently doesn't catch. Note: #278 tried `async-compression` streaming here and reverted because the forced `flate2`/`miniz_oxide` backend was slower than `zune-inflate`. This change is orthogonal — the decompressor still runs synchronously inside `spawn_blocking` on the buffered bytes.

### 4. Post-download concurrency cap is too high for Apple Silicon

**Upstream**: `worker/src/index.ts:71`:
```ts
return Math.max(1, availableParallelism() - 1)
```

**Pacquet**: `crates/tarball/src/lib.rs:38`:
```rust
SEM.get_or_init(|| Semaphore::new(num_cpus::get().saturating_mul(2).max(4)))
```

On a 10P-core M3 that's **20** concurrent post-download bodies; pnpm runs **9**. Each body is CPU-bound (SHA-512 over compressed tarball + gzip inflate + per-file SHA-512) with interleaved FS writes. Over-subscribing on macOS costs more than on Linux — context switches are slower, and P+E core mixing means some tasks land on efficiency cores and stretch the tail.

Current value was chosen to keep a 2-CPU GHA runner from wedging mid-decompress (#269). `num_cpus.saturating_sub(1).max(2)` matches pnpm and still clears that floor.

**Fix**: change the formula, measure on Apple Silicon before/after, confirm the 2-CPU floor still holds.

### 5. (Lower-confidence) Software SHA-512 on Apple Silicon

Pnpm's SHA-512 goes through Node's `crypto.hash` → OpenSSL → ARMv8 FEAT_SHA512 hardware instructions.

Pacquet uses `sha2 = "0.10.9"` with no features. The `asm` feature pulls in `sha2-asm`, which historically targeted x86/x86_64 only; aarch64 SHA-512 hardware support in `sha2` 0.10 is inconsistent.

**Fix**: either (a) enable the `asm` feature and verify it activates hardware SHA-512 on aarch64, or (b) swap the per-file / per-tarball hashing to the `ring` crate, which is BoringSSL-derived and definitively exposes ARMv8 FEAT_SHA512. Blocked on a macOS profile run confirming SHA-512 is actually in the hot path; if it isn't, skip.

## Sequencing

Items 1–3 are small, directly translate pnpm's own benchmark-driven decisions, and compose cleanly as a single PR. Item 4 is a one-line change but should be measured before/after on Apple Silicon — the current value was chosen for a Linux CI failure, not a perf decision. Item 5 should wait for a macOS profile to confirm SHA-512 is worth the crate-swap work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: port pnpm v11 HTTP/fetch topology and worker-pool sizing #280

Findings

1. HTTP/2 is silently enabled in pacquet; pnpm deliberately disables it

2. Concurrent connection cap is too low vs pnpm

3. Tarball buffer is grown via doubling instead of pre-allocated from `Content-Length`

4. Post-download concurrency cap is too high for Apple Silicon

5. (Lower-confidence) Software SHA-512 on Apple Silicon

Sequencing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

perf: port pnpm v11 HTTP/fetch topology and worker-pool sizing #280

Description

Findings

1. HTTP/2 is silently enabled in pacquet; pnpm deliberately disables it

2. Concurrent connection cap is too low vs pnpm

3. Tarball buffer is grown via doubling instead of pre-allocated from Content-Length

4. Post-download concurrency cap is too high for Apple Silicon

5. (Lower-confidence) Software SHA-512 on Apple Silicon

Sequencing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

3. Tarball buffer is grown via doubling instead of pre-allocated from `Content-Length`