Skip to content
This repository was archived by the owner on May 14, 2026. It is now read-only.
This repository was archived by the owner on May 14, 2026. It is now read-only.

Tarball-extract path opens a fresh SQLite connection per snapshot; 500+ blocking threads thrash on macOS #263

@zkochan

Description

@zkochan

Summary

On a 1352-snapshot frozen-lockfile install, pacquet is ~12× slower on macOS than on Linux CI even though the same machine runs pnpm ~3× faster than a GH runner does. The gap isn't CPU work; it's a storm of per-tarball StoreIndex::open calls on the write side of the index, paired with an unbounded tokio blocking pool.

Same root shape as #260 (which was the read side, fixed for reads by #261). The write path kept the old pattern.

Data

Integrated benchmark, frozen-lockfile, 1352-snapshot fixture (the one introduced by #262), same harness as CI:

CI (Ubuntu) Local (M1 Pro, APFS)
pacquet@HEAD wall 1.095 ± 0.009 s 13.235 ± 0.979 s
pacquet@HEAD user 0.40 s 0.62 s
pacquet@HEAD sys 2.59 s 20.87 s
pnpm wall 2.550 ± 0.026 s 0.930 ± 0.048 s
pnpm sys 2.25 s 0.33 s

User time is similar; sys time is ~8× higher on macOS. pnpm doing the same install on the same disk has 0.33 s sys, so the filesystem is fine — this is pacquet-specific.

Forcing each package-import-method doesn't move the number, ruling out the CAS→node_modules linker as the culprit:

method wall user sys
auto 12.40 s 0.61 18.05
hardlink 13.03 s 0.63 18.28
copy 14.70 s 0.65 19.69
clone 12.27 s 0.61 20.42

ps -M shows pacquet holds 534 threads from t+0.8 s through the whole install, most parked in kevent. That's tokio's blocking pool at its default cap of 512 + workers.

A 5 s sample confirms every tokio-rt-worker stack bottoms out in kevent — threads are spawned, do some work, and park.

Root cause

crates/tarball/src/lib.rs:427 spawns a blocking task per tarball to record the per-package row in the store index:

tokio::task::spawn_blocking(move || -> Result<(), StoreIndexError> {
    let store_index = StoreIndex::open(&v11_dir)?;
    store_index.set(&index_key, &pkg_files_idx)?;
    ...
})

For 1352 tarballs that's 1352 × StoreIndex::open (crates/store-dir/src/store_index.rs:100-130), where each call does:

  1. std::fs::create_dir_all(store_dir)
  2. Connection::open("…/index.db") (sqlite open: stat, open, fstat, pread, fcntl, mmap, plus WAL/SHM sidecar handling for the first writer)
  3. execute_batch of 7 PRAGMAs (busy_timeout=5000, journal_mode=WAL, synchronous=NORMAL, mmap_size=…, cache_size=-32000, temp_store=MEMORY, wal_autocheckpoint=10000) + CREATE TABLE IF NOT EXISTS

Then store_index.set inserts one row. The actual inserts serialize on SQLite's busy_timeout (the callsite comment acknowledges this), so the per-open setup cost is paid concurrently but the inserts run mostly one at a time. The callsite also explicitly notes "One StoreIndex per spawned task keeps the code lock-free" — that's the pattern this issue is asking to replace.

Why Linux CI doesn't see it

  • ext4 open/fstat/fcntl cost a fraction of APFS's per-syscall.
  • Linux pthreads + epoll handle 500 threads cheaply; XNU mach ports + kqueue charge more.
  • SQLite open on ext4 ≈ 0.5 ms; on this APFS ≈ 5–15 ms. 1352 × (5–15 ms) ≈ 7–20 s — matches the observed delta almost exactly.

Why #261 didn't help this path

#261 added StoreIndex::shared_readonly_in for the cache-lookup pass (1352 reads → 1 open). The write path still opens a fresh writable connection per tarball.

Proposed fixes (ordered by expected impact)

  1. Share a single writable StoreIndex across the install. Mirror perf(store-dir): share one read-only StoreIndex across cache lookups #261's pattern for writes: open once, wrap in Arc<Mutex<StoreIndex>> (or run a single writer task fed by an mpsc::channel), and thread it through download_and_extract_tarball. Collapses 1352 opens to 1. Biggest single win, smallest diff.
  2. Batch inserts in a transaction. SQLite WAL commit fsyncs once per transaction. Wrapping the per-package inserts in a single BEGIN IMMEDIATE; … COMMIT; (or small batches) removes 1352 independent commit fsyncs — another hidden APFS amplifier.
  3. Cap tokio's blocking pool. Runtime::max_blocking_threads(N) where N ≈ 2–4 × CPU count. 534 threads is pathological on macOS regardless of workload; this alone won't close the gap but it's cheap insurance.

(1) alone should get local macOS down to a comparable factor over CI (≈ 2–3×, matching the pnpm ratio). (2) and (3) are additive polish.

Repro

# bench env (uses the 1352-snapshot fixture from #262)
just integrated-benchmark --show-output --scenario=frozen-lockfile \
    --verdaccio --with-pnpm HEAD main

Bench harness note: verify::ensure_git_repo (tasks/integrated-benchmark/src/verify.rs:17) asserts .git is a directory, which fails when running against a git worktree — pass -R <non-worktree-clone> or fix the harness to accept .git-file worktrees.

Fixture note: pnpm-workspace.yaml's allowBuilds list silences core-js/es5-ext but not fsevents, so pnpm exits 1 on macOS with ERR_PNPM_IGNORED_BUILDS: fsevents@1.2.13. Worth adding so local runs don't wedge the bench.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions