refactor(store-dir): batch store-index writes via a single writer task#265
Conversation
Port pnpm v11's `queueWrites` / `flush` / `setRawMany` pattern from `store/index/src/index.ts` so the per-tarball store-index INSERT stops opening a fresh SQLite connection and pulling its own blocking-pool thread. * Add `StoreIndex::set_many`, which wraps a batch in `BEGIN IMMEDIATE; INSERT OR REPLACE x N; COMMIT;`. One WAL fsync amortizes across the batch, with a rollback on per-batch error so a partial apply never leaves the index in a half-written state. Encoding happens on the caller's thread before the transaction begins, so the writer only pays SQLite work. * Introduce `StoreIndexWriter`, an `Arc`-cloneable handle around a tokio unbounded channel. Producers call `queue(key, value)` — no SQLite, no `spawn_blocking`, no per-row lock. A single blocking task drains the channel, collects each non-blocking burst into a batch (cap 256), and flushes via `set_many`. * `CreateVirtualStore` and `InstallWithoutLockfile` spawn the writer at the top of the run, thread the `Arc<StoreIndexWriter>` through `InstallPackageBySnapshot` / `InstallPackageFromRegistry` / `DownloadTarballToStore`, drop the orchestration's clone after `try_join_all`, then `await` the writer task so the final batch flushes before the install returns. * `DownloadTarballToStore::run_without_mem_cache` drops its per-tarball `spawn_blocking` for the index write entirely. Batch errors, `JoinError`s, and writer-open failures are all downgraded to `warn!` and "skip the row", matching the read side's graceful-degradation stance from #261. Tests pass against the existing `pacquet-registry-mock` harness. Context: opened while investigating #263. On macOS I haven't been able to demonstrate a wall-time win from this change yet — the measurable bottleneck in pacquet's cold-install sys time sits elsewhere (most likely tarball extraction / `write_cas_file` and sha512 per file), see the comment thread on #263 for the investigation notes and sample profile. The change still stands as structural cleanup and the pnpm-model alignment, and it will compound with whatever fix the real bottleneck gets.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #265 +/- ##
==========================================
- Coverage 89.75% 89.50% -0.25%
==========================================
Files 58 58
Lines 4527 4613 +86
==========================================
+ Hits 4063 4129 +66
- Misses 464 484 +20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Micro-Benchmark ResultsLinux |
Integrated-Benchmark Report (Linux)Scenario: Frozen Lockfile
BENCHMARK_REPORT.json{
"results": [
{
"command": "pacquet@HEAD",
"mean": 1.1243672597800003,
"stddev": 0.010178035218097145,
"median": 1.12393394188,
"user": 0.41155511999999994,
"system": 2.67024266,
"min": 1.1078520893800001,
"max": 1.1393580863800001,
"times": [
1.1393580863800001,
1.11056244938,
1.12388425938,
1.13867615238,
1.12387326738,
1.12125082138,
1.1244615233800002,
1.1078520893800001,
1.1297703243800001,
1.12398362438
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
},
{
"command": "pacquet@main",
"mean": 1.12036072658,
"stddev": 0.015341339975821107,
"median": 1.11680940788,
"user": 0.41388971999999996,
"system": 2.6484198599999997,
"min": 1.1057865553800001,
"max": 1.16124444138,
"times": [
1.12438208138,
1.11957954738,
1.11254007138,
1.1175660543800001,
1.11045507338,
1.1212212133800001,
1.16124444138,
1.1057865553800001,
1.1147794663800001,
1.11605276138
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
},
{
"command": "pnpm",
"mean": 2.56456967258,
"stddev": 0.05098121894869429,
"median": 2.5550952903799997,
"user": 3.1824605199999993,
"system": 2.2935985600000004,
"min": 2.5016076903799997,
"max": 2.6600094723799996,
"times": [
2.60640866838,
2.5505876553799998,
2.55056304338,
2.6600094723799996,
2.5261525993799996,
2.55960292538,
2.5029310243799996,
2.5016076903799997,
2.5698935223799997,
2.61794012438
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
}
]
} |
There was a problem hiding this comment.
Pull request overview
Refactors pacquet’s store-index write path to avoid per-tarball SQLite opens and per-tarball spawn_blocking, by introducing a single writer task that batches index rows into transactional writes.
Changes:
- Add
StoreIndexWriter(single writer task + batching) andStoreIndex::set_many(transactional batch insert/replace). - Thread an optional
store_index_writerhandle through tarball download/extract and package install flows, and await the writer task at the end of installs. - Update benchmarks/tests callsites and add required dependencies (
tokio,tracing) topacquet-store-dir.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tasks/micro-benchmark/src/main.rs | Passes new store_index_writer: None field through benchmark setup. |
| crates/tarball/src/lib.rs | Queues index rows to the shared writer instead of per-tarball SQLite writes; wires new field through tests. |
| crates/store-dir/src/store_index.rs | Introduces StoreIndexWriter and StoreIndex::set_many for batched transactional writes. |
| crates/store-dir/Cargo.toml | Adds tokio (sync) and tracing dependencies to support the writer task + logging. |
| crates/package-manager/src/install_without_lockfile.rs | Spawns writer once per run, threads handle through recursion, awaits writer task at end. |
| crates/package-manager/src/install_package_from_registry.rs | Propagates writer handle into tarball download. |
| crates/package-manager/src/install_package_by_snapshot.rs | Propagates writer handle into tarball download for lockfile snapshot installs. |
| crates/package-manager/src/create_virtual_store.rs | Spawns writer once per run, threads handle to snapshot installs, awaits writer task at end. |
| Cargo.lock | Locks new transitive deps from adding tokio + tracing to pacquet-store-dir. |
Comments suppressed due to low confidence (1)
crates/tarball/src/lib.rs:454
tokio::task::JoinHandle::awaitcan return aJoinError(panic or cancellation), but this path usesexpect("no join error"), which will panic the whole install. SinceTarballErroralready has aTaskJoin(JoinError)variant, consider mapping the join error into that variant (or degrading to a cache miss) instead of panicking.
})
.await
.expect("no join error")
.map_err(|error| match error {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if let Err(error) = index.set_many(batch.drain(..)) { | ||
| // Drop the batch and keep going. One failed flush | ||
| // (e.g. a disk-full hiccup) shouldn't silently drop | ||
| // the rest of the install's entries; the next install | ||
| // will cache-miss those rows and re-populate them, | ||
| // matching the "best-effort index" stance the read | ||
| // path already takes. | ||
| tracing::warn!( | ||
| target: "pacquet::store_index", | ||
| ?error, | ||
| "batched store-index write failed; dropping this batch and continuing", | ||
| ); | ||
| batch.clear(); | ||
| } |
There was a problem hiding this comment.
On set_many failure, the writer drops the entire batch (up to MAX_BATCH_SIZE entries). The PR description mentions "skip the row", but this behavior can lose many otherwise-valid rows due to a single transient SQLite error. Consider retrying with smaller batches / per-row fallback (or binary-splitting the batch to isolate a bad row) so failures are localized while still keeping batching for the common case.
There was a problem hiding this comment.
Took a look and leaving this as-is for now. The "drop the whole batch" behaviour is the same graceful-degradation stance the read path already takes — a failed flush means those rows won't be in index.db, so the next install sees them as cache-miss, re-downloads, and re-writes on a fresh batch. User-visible consequence: "the store gets re-hydrated on the next install", which is also what happens after any interrupted install today.
Binary-splitting or per-row fallback would isolate a genuinely bad row at the cost of O(log N) extra transactions on every failure, plus book-keeping about which rows were already applied. Happy to add that if/when we see real errors in the wild that aren't "disk full" or "DB corrupted" (both of which won't be helped by smaller batches). For now the simpler fallback seems like the right tradeoff.
| let v11_dir = store_dir.v11(); | ||
| let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::<(String, PackageFilesIndex)>(); | ||
| let handle = tokio::task::spawn_blocking(move || -> Result<(), StoreIndexError> { |
There was a problem hiding this comment.
StoreIndexWriter uses an unbounded mpsc channel, so if producers enqueue faster than SQLite can commit (e.g. slow/durable disk, contention), the queue can grow without bound and cause high memory usage. Consider using a bounded channel (with backpressure or try_send + drop policy) or adding an explicit queue-size cap/metrics to prevent unbounded buffering.
There was a problem hiding this comment.
Valid concern at scale — keeping it unbounded for now and noting the tradeoff in the design block. Rough numbers for today's workload: PackageFilesIndex is a small struct (a HashMap of CafsFileInfo per file, most entries a few hundred bytes), so worst case per entry is maybe 10 KB and realistic is closer to 1 KB. A 1352-snapshot install with zero drain capacity would sit at ~5 MB peak. The writer drains in bursts of 256 and SQLite transactions commit fast enough in practice that the buffer stays in the low hundreds even on slow disks.
If we ever see memory pressure in a large-monorepo install we can switch to mpsc::channel::<_>(N) and add try_send handling at the producer. That keeps the fast-path identical but gains backpressure on the bad-path.
`collect::<Result<_, _>>()?` turned a single `encode_package_files_index` failure into "drop the whole batch" — up to 256 otherwise-valid rows lost because of one malformed `PackageFilesIndex`. That's stricter than the "best-effort index" contract the writer task and the read-side already follow. Encode in a loop, log the failing key at `warn!`, and keep the rest of the batch alive for commit. SQLite errors inside the transaction still roll back the whole batch (that's how atomicity is supposed to work); only the per-row encoding step is skipped. Addresses Copilot review on #265.
…irst failure; fix private-doc-link Three follow-up nits from the Copilot review on #265: * The comment in `tarball/src/lib.rs` described #263 as "fixed" — it isn't, this PR is one step toward fixing it. Reworded to "the per-write blocking thread churn described in #263" so the citation survives the landing of both. * `StoreIndexWriter::queue` logged a `warn!` on every `send` failure. When the writer task exits early (open failure, panic) every subsequent `queue` call fails — on a 1352-snapshot install that's a thousand identical warnings drowning everything else out. Add a `warn_on_send_failure: AtomicBool` one-shot guard: first failure logs (with "further failures silenced" hint), subsequent failures go silent. The observable signal the user actually cares about — "this install's rows aren't in the index, the next install will re-download" — surfaces on the next run regardless. * `StoreIndexWriter`'s doc block linked to the private constant `MAX_BATCH_SIZE`. `RUSTDOCFLAGS=-D warnings` in CI rejects the private-intra-doc-link lint, so replace the link with a literal "256 entries" plus a `see MAX_BATCH_SIZE` pointer in prose. Addresses Copilot review on #265.
Port pnpm v11's
queueWrites/flush/setRawManypattern fromstore/index/src/index.ts. The write path used to doStoreIndex::open+ 7-PRAGMA + solo-INSERT per extracted tarball, each one under its owntokio::task::spawn_blocking— so a 1352-snapshot install opened SQLite 1352 times and ballooned tokio's blocking pool on every install. This PR collapses that to one open + one writer thread + N batched transactions.Design
StoreIndex::set_many— wraps a batch inBEGIN IMMEDIATE; INSERT OR REPLACE xN; COMMIT;. One WAL fsync amortizes across the batch, with a rollback on per-batch error so a partial apply never leaves the index in a half-written state. Encoding happens on the caller's thread before the transaction begins, so the writer only pays SQLite work.StoreIndexWriter—Arc-cloneable handle around a tokio unbounded channel. Producers callqueue(key, value)— no SQLite, nospawn_blocking, no per-row lock. A single blocking task drains the channel, collects each non-blocking burst into a batch (cap 256), and flushes viaset_many.CreateVirtualStoreandInstallWithoutLockfilespawn the writer at the top of the run, threadArc<StoreIndexWriter>throughInstallPackageBySnapshot/InstallPackageFromRegistry/DownloadTarballToStore, drop the orchestration's clone aftertry_join_all, thenawaitthe writer task so the final batch flushes before the install returns.DownloadTarballToStore::run_without_mem_cachedrops its per-tarballspawn_blockingfor the index write entirely.Batch errors,
JoinErrors, and writer-open failures are all downgraded towarn!and "skip the row", matching the read side's graceful-degradation stance from #261.Honest framing — opened as
refactor, notperfContext: this came out of investigating #263. On macOS I have not been able to demonstrate a wall-time win from this change — fix vs main cold-install runs are in the same 12–24 s variance band with identical sys time. The measurable bottleneck in pacquet's cold-install sys time sits elsewhere, most likely in tarball extraction /
write_cas_fileand sha512-per-file, see the comment thread on #263 for the investigation notes and sample profiles.The change still stands as structural cleanup:
index.db.StoreIndex::open(+ 7 PRAGMAs) churn.spawn_blockingfor index writes from the hot path.If reviewers would rather hold this until the actual hot path is fixed so the two can land together (the batching framework makes things like #260's per-file stat-storm fix cleaner to wire in), happy to park the branch.
Depends on #264 for anyone wanting to bench this locally on macOS — without that PR the bench silently measures the wrong path.
Test plan
cargo check --workspace --all-targets.cargo test -p pacquet-tarball -p pacquet-store-dir -p pacquet-package-managergreen (against the preparedpacquet-registry-mockinstance).Someand writer-taskJoinErrordowngrade) without panicking.