Skip to content
This repository was archived by the owner on May 14, 2026. It is now read-only.

refactor(store-dir): batch store-index writes via a single writer task#265

Merged
zkochan merged 3 commits into
mainfrom
refactor/batched-store-index-writer
Apr 24, 2026
Merged

refactor(store-dir): batch store-index writes via a single writer task#265
zkochan merged 3 commits into
mainfrom
refactor/batched-store-index-writer

Conversation

@zkochan

@zkochan zkochan commented Apr 24, 2026

Copy link
Copy Markdown
Member

Port pnpm v11's queueWrites / flush / setRawMany pattern from store/index/src/index.ts. The write path used to do StoreIndex::open + 7-PRAGMA + solo-INSERT per extracted tarball, each one under its own tokio::task::spawn_blocking — so a 1352-snapshot install opened SQLite 1352 times and ballooned tokio's blocking pool on every install. This PR collapses that to one open + one writer thread + N batched transactions.

Design

  • StoreIndex::set_many — wraps a batch in BEGIN IMMEDIATE; INSERT OR REPLACE xN; COMMIT;. One WAL fsync amortizes across the batch, with a rollback on per-batch error so a partial apply never leaves the index in a half-written state. Encoding happens on the caller's thread before the transaction begins, so the writer only pays SQLite work.
  • StoreIndexWriterArc-cloneable handle around a tokio unbounded channel. Producers call queue(key, value) — no SQLite, no spawn_blocking, no per-row lock. A single blocking task drains the channel, collects each non-blocking burst into a batch (cap 256), and flushes via set_many.
  • OrchestrationCreateVirtualStore and InstallWithoutLockfile spawn the writer at the top of the run, thread Arc<StoreIndexWriter> through InstallPackageBySnapshot / InstallPackageFromRegistry / DownloadTarballToStore, drop the orchestration's clone after try_join_all, then await the writer task so the final batch flushes before the install returns.
  • Tarball pathDownloadTarballToStore::run_without_mem_cache drops its per-tarball spawn_blocking for the index write entirely.

Batch errors, JoinErrors, and writer-open failures are all downgraded to warn! and "skip the row", matching the read side's graceful-degradation stance from #261.

Honest framing — opened as refactor, not perf

Context: this came out of investigating #263. On macOS I have not been able to demonstrate a wall-time win from this change — fix vs main cold-install runs are in the same 12–24 s variance band with identical sys time. The measurable bottleneck in pacquet's cold-install sys time sits elsewhere, most likely in tarball extraction / write_cas_file and sha512-per-file, see the comment thread on #263 for the investigation notes and sample profiles.

The change still stands as structural cleanup:

  • Matches pnpm v11's model one-for-one, so pacquet and pnpm use the same write strategy against a shared index.db.
  • Collapses the per-tarball StoreIndex::open (+ 7 PRAGMAs) churn.
  • Removes the per-tarball spawn_blocking for index writes from the hot path.
  • Compounds with whatever fix the real bottleneck eventually gets.

If reviewers would rather hold this until the actual hot path is fixed so the two can land together (the batching framework makes things like #260's per-file stat-storm fix cleaner to wire in), happy to park the branch.

Depends on #264 for anyone wanting to bench this locally on macOS — without that PR the bench silently measures the wrong path.

Test plan

  • cargo check --workspace --all-targets.
  • cargo test -p pacquet-tarball -p pacquet-store-dir -p pacquet-package-manager green (against the prepared pacquet-registry-mock instance).
  • Manual cold install against a real verdaccio exercises both branches (writer Some and writer-task JoinError downgrade) without panicking.
  • CI Integrated-Benchmark — expected to be a wash on Linux (the open cost is small on ext4 and the 1352 threads were cheap there), primary goal is correctness, not wall-time.

Port pnpm v11's `queueWrites` / `flush` / `setRawMany` pattern from
`store/index/src/index.ts` so the per-tarball store-index INSERT
stops opening a fresh SQLite connection and pulling its own
blocking-pool thread.

* Add `StoreIndex::set_many`, which wraps a batch in
  `BEGIN IMMEDIATE; INSERT OR REPLACE x N; COMMIT;`. One WAL fsync
  amortizes across the batch, with a rollback on per-batch error so
  a partial apply never leaves the index in a half-written state.
  Encoding happens on the caller's thread before the transaction
  begins, so the writer only pays SQLite work.
* Introduce `StoreIndexWriter`, an `Arc`-cloneable handle around a
  tokio unbounded channel. Producers call `queue(key, value)` — no
  SQLite, no `spawn_blocking`, no per-row lock. A single blocking
  task drains the channel, collects each non-blocking burst into a
  batch (cap 256), and flushes via `set_many`.
* `CreateVirtualStore` and `InstallWithoutLockfile` spawn the writer
  at the top of the run, thread the `Arc<StoreIndexWriter>` through
  `InstallPackageBySnapshot` / `InstallPackageFromRegistry` /
  `DownloadTarballToStore`, drop the orchestration's clone after
  `try_join_all`, then `await` the writer task so the final batch
  flushes before the install returns.
* `DownloadTarballToStore::run_without_mem_cache` drops its
  per-tarball `spawn_blocking` for the index write entirely.

Batch errors, `JoinError`s, and writer-open failures are all
downgraded to `warn!` and "skip the row", matching the read side's
graceful-degradation stance from #261. Tests pass against the
existing `pacquet-registry-mock` harness.

Context: opened while investigating #263. On macOS I haven't been
able to demonstrate a wall-time win from this change yet — the
measurable bottleneck in pacquet's cold-install sys time sits
elsewhere (most likely tarball extraction / `write_cas_file` and
sha512 per file), see the comment thread on #263 for the
investigation notes and sample profile. The change still stands
as structural cleanup and the pnpm-model alignment, and it will
compound with whatever fix the real bottleneck gets.
@codecov

codecov Bot commented Apr 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.69388% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.50%. Comparing base (edd07c3) to head (1bc861c).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
crates/store-dir/src/store_index.rs 84.90% 8 Missing ⚠️
crates/package-manager/src/create_virtual_store.rs 62.50% 3 Missing ⚠️
...package-manager/src/install_package_by_snapshot.rs 0.00% 2 Missing ⚠️
...es/package-manager/src/install_without_lockfile.rs 90.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #265      +/-   ##
==========================================
- Coverage   89.75%   89.50%   -0.25%     
==========================================
  Files          58       58              
  Lines        4527     4613      +86     
==========================================
+ Hits         4063     4129      +66     
- Misses        464      484      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Apr 24, 2026

Copy link
Copy Markdown

Micro-Benchmark Results

Linux

group                          main                                   pr
-----                          ----                                   --
tarball/download_dependency    1.23     19.6±0.80ms   221.4 KB/sec    1.00     16.0±0.33ms   271.4 KB/sec

@github-actions

github-actions Bot commented Apr 24, 2026

Copy link
Copy Markdown

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Command Mean [s] Min [s] Max [s] Relative
pacquet@HEAD 1.124 ± 0.010 1.108 1.139 1.00 ± 0.02
pacquet@main 1.120 ± 0.015 1.106 1.161 1.00
pnpm 2.565 ± 0.051 2.502 2.660 2.29 ± 0.06
BENCHMARK_REPORT.json
{
  "results": [
    {
      "command": "pacquet@HEAD",
      "mean": 1.1243672597800003,
      "stddev": 0.010178035218097145,
      "median": 1.12393394188,
      "user": 0.41155511999999994,
      "system": 2.67024266,
      "min": 1.1078520893800001,
      "max": 1.1393580863800001,
      "times": [
        1.1393580863800001,
        1.11056244938,
        1.12388425938,
        1.13867615238,
        1.12387326738,
        1.12125082138,
        1.1244615233800002,
        1.1078520893800001,
        1.1297703243800001,
        1.12398362438
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pacquet@main",
      "mean": 1.12036072658,
      "stddev": 0.015341339975821107,
      "median": 1.11680940788,
      "user": 0.41388971999999996,
      "system": 2.6484198599999997,
      "min": 1.1057865553800001,
      "max": 1.16124444138,
      "times": [
        1.12438208138,
        1.11957954738,
        1.11254007138,
        1.1175660543800001,
        1.11045507338,
        1.1212212133800001,
        1.16124444138,
        1.1057865553800001,
        1.1147794663800001,
        1.11605276138
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pnpm",
      "mean": 2.56456967258,
      "stddev": 0.05098121894869429,
      "median": 2.5550952903799997,
      "user": 3.1824605199999993,
      "system": 2.2935985600000004,
      "min": 2.5016076903799997,
      "max": 2.6600094723799996,
      "times": [
        2.60640866838,
        2.5505876553799998,
        2.55056304338,
        2.6600094723799996,
        2.5261525993799996,
        2.55960292538,
        2.5029310243799996,
        2.5016076903799997,
        2.5698935223799997,
        2.61794012438
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    }
  ]
}

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors pacquet’s store-index write path to avoid per-tarball SQLite opens and per-tarball spawn_blocking, by introducing a single writer task that batches index rows into transactional writes.

Changes:

  • Add StoreIndexWriter (single writer task + batching) and StoreIndex::set_many (transactional batch insert/replace).
  • Thread an optional store_index_writer handle through tarball download/extract and package install flows, and await the writer task at the end of installs.
  • Update benchmarks/tests callsites and add required dependencies (tokio, tracing) to pacquet-store-dir.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tasks/micro-benchmark/src/main.rs Passes new store_index_writer: None field through benchmark setup.
crates/tarball/src/lib.rs Queues index rows to the shared writer instead of per-tarball SQLite writes; wires new field through tests.
crates/store-dir/src/store_index.rs Introduces StoreIndexWriter and StoreIndex::set_many for batched transactional writes.
crates/store-dir/Cargo.toml Adds tokio (sync) and tracing dependencies to support the writer task + logging.
crates/package-manager/src/install_without_lockfile.rs Spawns writer once per run, threads handle through recursion, awaits writer task at end.
crates/package-manager/src/install_package_from_registry.rs Propagates writer handle into tarball download.
crates/package-manager/src/install_package_by_snapshot.rs Propagates writer handle into tarball download for lockfile snapshot installs.
crates/package-manager/src/create_virtual_store.rs Spawns writer once per run, threads handle to snapshot installs, awaits writer task at end.
Cargo.lock Locks new transitive deps from adding tokio + tracing to pacquet-store-dir.
Comments suppressed due to low confidence (1)

crates/tarball/src/lib.rs:454

  • tokio::task::JoinHandle::await can return a JoinError (panic or cancellation), but this path uses expect("no join error"), which will panic the whole install. Since TarballError already has a TaskJoin(JoinError) variant, consider mapping the join error into that variant (or degrading to a cache miss) instead of panicking.
        })
        .await
        .expect("no join error")
        .map_err(|error| match error {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +94 to +107
if let Err(error) = index.set_many(batch.drain(..)) {
// Drop the batch and keep going. One failed flush
// (e.g. a disk-full hiccup) shouldn't silently drop
// the rest of the install's entries; the next install
// will cache-miss those rows and re-populate them,
// matching the "best-effort index" stance the read
// path already takes.
tracing::warn!(
target: "pacquet::store_index",
?error,
"batched store-index write failed; dropping this batch and continuing",
);
batch.clear();
}

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On set_many failure, the writer drops the entire batch (up to MAX_BATCH_SIZE entries). The PR description mentions "skip the row", but this behavior can lose many otherwise-valid rows due to a single transient SQLite error. Consider retrying with smaller batches / per-row fallback (or binary-splitting the batch to isolate a bad row) so failures are localized while still keeping batching for the common case.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look and leaving this as-is for now. The "drop the whole batch" behaviour is the same graceful-degradation stance the read path already takes — a failed flush means those rows won't be in index.db, so the next install sees them as cache-miss, re-downloads, and re-writes on a fresh batch. User-visible consequence: "the store gets re-hydrated on the next install", which is also what happens after any interrupted install today.

Binary-splitting or per-row fallback would isolate a genuinely bad row at the cost of O(log N) extra transactions on every failure, plus book-keeping about which rows were already applied. Happy to add that if/when we see real errors in the wild that aren't "disk full" or "DB corrupted" (both of which won't be helped by smaller batches). For now the simpler fallback seems like the right tradeoff.

Comment thread crates/store-dir/src/store_index.rs Outdated
Comment thread crates/tarball/src/lib.rs Outdated
Comment on lines +71 to +73
let v11_dir = store_dir.v11();
let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::<(String, PackageFilesIndex)>();
let handle = tokio::task::spawn_blocking(move || -> Result<(), StoreIndexError> {

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StoreIndexWriter uses an unbounded mpsc channel, so if producers enqueue faster than SQLite can commit (e.g. slow/durable disk, contention), the queue can grow without bound and cause high memory usage. Consider using a bounded channel (with backpressure or try_send + drop policy) or adding an explicit queue-size cap/metrics to prevent unbounded buffering.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern at scale — keeping it unbounded for now and noting the tradeoff in the design block. Rough numbers for today's workload: PackageFilesIndex is a small struct (a HashMap of CafsFileInfo per file, most entries a few hundred bytes), so worst case per entry is maybe 10 KB and realistic is closer to 1 KB. A 1352-snapshot install with zero drain capacity would sit at ~5 MB peak. The writer drains in bursts of 256 and SQLite transactions commit fast enough in practice that the buffer stays in the low hundreds even on slow disks.

If we ever see memory pressure in a large-monorepo install we can switch to mpsc::channel::<_>(N) and add try_send handling at the producer. That keeps the fast-path identical but gains backpressure on the bad-path.

Comment thread crates/store-dir/src/store_index.rs
zkochan added 2 commits April 24, 2026 13:32
`collect::<Result<_, _>>()?` turned a single `encode_package_files_index`
failure into "drop the whole batch" — up to 256 otherwise-valid rows
lost because of one malformed `PackageFilesIndex`. That's stricter
than the "best-effort index" contract the writer task and the
read-side already follow.

Encode in a loop, log the failing key at `warn!`, and keep the rest
of the batch alive for commit. SQLite errors inside the transaction
still roll back the whole batch (that's how atomicity is supposed
to work); only the per-row encoding step is skipped.

Addresses Copilot review on #265.
…irst failure; fix private-doc-link

Three follow-up nits from the Copilot review on #265:

* The comment in `tarball/src/lib.rs` described #263 as "fixed" — it
  isn't, this PR is one step toward fixing it. Reworded to "the
  per-write blocking thread churn described in #263" so the
  citation survives the landing of both.

* `StoreIndexWriter::queue` logged a `warn!` on every `send`
  failure. When the writer task exits early (open failure, panic)
  every subsequent `queue` call fails — on a 1352-snapshot install
  that's a thousand identical warnings drowning everything else
  out. Add a `warn_on_send_failure: AtomicBool` one-shot guard:
  first failure logs (with "further failures silenced" hint),
  subsequent failures go silent. The observable signal the user
  actually cares about — "this install's rows aren't in the index,
  the next install will re-download" — surfaces on the next run
  regardless.

* `StoreIndexWriter`'s doc block linked to the private constant
  `MAX_BATCH_SIZE`. `RUSTDOCFLAGS=-D warnings` in CI rejects the
  private-intra-doc-link lint, so replace the link with a literal
  "256 entries" plus a `see MAX_BATCH_SIZE` pointer in prose.

Addresses Copilot review on #265.
@zkochan zkochan merged commit 5d06cd5 into main Apr 24, 2026
12 of 14 checks passed
@zkochan zkochan deleted the refactor/batched-store-index-writer branch April 24, 2026 13:06
zkochan added a commit that referenced this pull request Apr 24, 2026
# Conflicts:
#	crates/tarball/src/lib.rs
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants