fix(pacquet/store-dir): close writer×verifier race in verify_file#11825
Conversation
…write_lock The verifier was racing in-flight writers: while `ensure_file` held `cas_write_lock(path)` and was partway through `write_all`, a sibling snapshot's `check_pkg_files_integrity` would stat the same CAS path (no lock), see a partial size, and call `remove_stale_cafs_entry(path)`. The writer's `write_all` then continued against an orphan inode, the writer's `cas_paths` was populated with the now-deleted path, and `link_file` later hit ENOENT — the CI failure shape on #11816 / 0.2.2-7. Fix (Option C): keep `verify_file`'s lock-free fast path (the common case: file unchanged since prior install, `is_modified` false), but acquire `cas_write_lock(path)` before any branch that could call `remove_stale_cafs_entry`. Re-`check_file` under the lock so a writer's full `write_all` lands before we evaluate. Performance: the fast path adds zero overhead. The slow path — files whose mtime is > 100 ms past the recorded `checked_at` — takes one uncontended Mutex acquire per file, sub-microsecond on uncontended locks. Contention only fires when a writer + a verifier hit the same blob simultaneously; the wait is bounded by one `write_all` and trades a millisecond-scale wait for avoiding a network re-fetch. The new integration test in `pacquet-store-dir/tests/` is a deterministic reproducer: it acquires `cas_write_lock` from the test thread (standing in for an in-flight writer), pre-seeds a partial CAS file at the matching path, and runs the verifier in the background. Pre-fix, the verifier unlinks the file while the "writer" is still simulated as in-progress; post-fix, the verifier blocks on the lock until released. To make `cas_write_lock` reachable from the store-dir crate the function was promoted from `fn` to `pub fn` in pacquet-fs.
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
📝 WalkthroughWalkthroughThis PR fixes a writer-verifier race condition in CAS file verification by exporting a per-path write lock and updating the verification logic to acquire it before re-statting and potentially deleting modified files. Adds comprehensive concurrency tests validating the locking behavior. ChangesCAS Write Lock and Verification Concurrency
Sequence DiagramsequenceDiagram
participant Verifier
participant MetadataCheck
participant WriteLock
participant FileSystem
Verifier->>MetadataCheck: is_modified check (fast path)
alt File not modified
MetadataCheck-->>Verifier: return early
else File appears modified
Verifier->>WriteLock: acquire cas_write_lock
WriteLock->>MetadataCheck: re-check under lock
MetadataCheck->>FileSystem: stat file
alt File still modified
MetadataCheck-->>Verifier: file changed, proceed with validation
else File no longer modified or missing
FileSystem-->>Verifier: file gone or unchanged
Verifier-->>Verifier: trigger re-fetch
end
Verifier->>WriteLock: release lock
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Micro-Benchmark ResultsLinux |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pacquet/crates/store-dir/tests/verify_writer_race.rs (1)
134-147: ⚡ Quick winAssert the verifier result, not just file survival.
Line 147 throws away
result.passed, so this stays green even if a future regression waits on the lock but still returnspassed = false. For this race fix, the stronger contract is that verification succeeds after the writer commits the final blob.Proposed tightening
- let result_slot: Arc<Mutex<Option<bool>>> = Arc::new(Mutex::new(None)); - let verify_store = StoreDir::new(tmp.path().to_path_buf()); let verify_content = expected_content.clone(); - let result_slot_writer = Arc::clone(&result_slot); let target_for_verifier = target.clone(); let verifier = thread::spawn(move || { verifier_started_tx.send(()).expect("send start"); let pkg_index = make_index("LICENSE", &verify_content); let cache = VerifiedFilesCache::new(); let result = check_pkg_files_integrity(&verify_store, pkg_index, &cache); - // Record whether the file survived the verifier's run. - *result_slot_writer.lock().expect("result mutex") = - Some((target_for_verifier.exists(), result.passed).0); + (target_for_verifier.exists(), result.passed) }); @@ - verifier.join().expect("verifier thread should not panic"); + let (_file_survived_verifier, verification_passed) = + verifier.join().expect("verifier thread should not panic"); let file_exists_after_verify = target.exists(); @@ assert!( file_exists_after_verify, "After the simulated writer released its lock, the verifier should observe \ the final (correct) content and not delete it", ); + assert!( + verification_passed, + "After the writer commits the full blob and releases the lock, verification should pass", + );Also applies to: 173-199
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pacquet/crates/store-dir/tests/verify_writer_race.rs` around lines 134 - 147, The verifier thread currently stores only file existence into result_slot (via target_for_verifier.exists()) and drops result.passed; change the assignment in the verifier closure (where result = check_pkg_files_integrity(...) and you set *result_slot_writer.lock() = Some(...)) to record the verification outcome instead of (or in addition to) existence — e.g., store result.passed (or store a conjunction of exists() && result.passed) so the test asserts the verifier returned passed; apply the same change in the other verifier block mentioned (lines 173–199) to record check_pkg_files_integrity(...)'s result.passed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pacquet/crates/store-dir/tests/verify_writer_race.rs`:
- Around line 134-147: The verifier thread currently stores only file existence
into result_slot (via target_for_verifier.exists()) and drops result.passed;
change the assignment in the verifier closure (where result =
check_pkg_files_integrity(...) and you set *result_slot_writer.lock() =
Some(...)) to record the verification outcome instead of (or in addition to)
existence — e.g., store result.passed (or store a conjunction of exists() &&
result.passed) so the test asserts the verifier returned passed; apply the same
change in the other verifier block mentioned (lines 173–199) to record
check_pkg_files_integrity(...)'s result.passed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: a5d93efa-93f6-412b-9e63-a34fe8d8e744
📒 Files selected for processing (4)
pacquet/crates/fs/src/ensure_file.rspacquet/crates/store-dir/Cargo.tomlpacquet/crates/store-dir/src/check_pkg_files_integrity.rspacquet/crates/store-dir/tests/verify_writer_race.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Run benchmark on ubuntu-latest
- GitHub Check: Lint and Test (macos-latest)
- GitHub Check: Run benchmark on ubuntu-latest
- GitHub Check: Code Coverage
- GitHub Check: Lint and Test (windows-latest)
- GitHub Check: Lint and Test (ubuntu-latest)
- GitHub Check: Compile & Lint
🧰 Additional context used
📓 Path-based instructions (2)
pacquet/**/*.rs
📄 CodeRabbit inference engine (pacquet/AGENTS.md)
pacquet/**/*.rs: When porting a function that firespnpm:<channel>events throughglobalLogger,logger.debug(), orstreamParser.write(), mirror the call site, payload, and ordering so the reporter parses pacquet's NDJSON the same way it parses pnpm's.
Declare a newtype wrapper for branded string types. Do not collapse the brand into a plainStringor&str.
If upstream always validates before construction, validate in pacquet's wrapper too. The wrapper must construct only viaTryFrom<String>and/orFromStr. Do not provide an infallible public constructor.
If upstream never validates, just brand for type-safety. Expose an infallibleFrom<String>(andFrom<&str>when convenient).
If upstream occasionally constructs without validation, exposefrom_str_uncheckedas an escape hatch alongside the validating constructor.
Match upstream serde behavior for branded types that cross JSON, YAML, or INI boundaries. Use#[serde(try_from = "String")]for deserialization and#[serde(into = "String")]for serialization.
Use#[derive(derive_more::From)]and#[derive(derive_more::Into)]for mechanical conversion impls. Fall back to manualimplonly when conversion needs custom logic.
String-literal unions should becomeenums, not newtype wrappers. Model closed sets of valid string values as enums.
Template literal types should be treated as branded strings with validation discipline from rules 2-5.
Choose owned vs. borrowed parameters to minimize copies. Widen to the most encompassing type (&Pathover&PathBuf,&strover&String) when it doesn't force extra copies.
PreferArc::clone(&x)/Rc::clone(&x)overx.clone()for reference-counted types, so the cost is visible at the call site.
Follow Rust API Guidelines for naming conventions.
Do not use star imports inside module bodies. Writeuse super::{Foo, bar}instead ofuse super::*;. Two forms stay allowed: external-crate preludes likeuse rayon::prelude::*;and root-of-module re-...
Files:
pacquet/crates/fs/src/ensure_file.rspacquet/crates/store-dir/src/check_pkg_files_integrity.rspacquet/crates/store-dir/tests/verify_writer_race.rs
pacquet/**/{tests,test}/**/*.rs
📄 CodeRabbit inference engine (pacquet/AGENTS.md)
pacquet/**/{tests,test}/**/*.rs: Prefer real fixtures; reach for the dependency-injection seam only when they can't cover the branch. Most happy paths and error paths should be tested withtempfile::TempDir, the mocked registry, or an integration test that spawns the actual binary.
Use the DI seam — a capability trait on theHostprovider, threaded asSys: <Bounds>— only for branches a real fixture can't reach portably: filesystem error kinds, deterministic time, shared process-global state, or external-service happy paths.
Log before non-assert_eq!assertions,dbg!complex structures, skip logging for simple scalarassert_eq!. Follow the test-logging guidance in CODE_STYLE_GUIDE.md.
Tests must not be tolerant of missing build/runtime environment by silently returning early when a tool isn't found. If the test needs a tool, let it panic when the tool is absent.
Prefer#[cfg_attr(target_os = "windows", ignore = "...")](or#[cfg(unix)]) over a runtime probe-and-skip helper for platform-locked tools.
Port relevant tests from pnpm when porting behavior. Matching test coverage is the easiest way to prove behavioral parity.
When porting behavior from pnpm, consult plans/TEST_PORTING.md before adding ported tests. Useallow_known_failure!at the not-yet-implemented boundary and update checkboxes as items land.
Files:
pacquet/crates/store-dir/tests/verify_writer_race.rs
🧠 Learnings (2)
📚 Learning: 2026-05-20T19:40:55.051Z
Learnt from: zkochan
Repo: pnpm/pnpm PR: 11774
File: pacquet/crates/resolving-deps-resolver/src/resolve_peers.rs:0-0
Timestamp: 2026-05-20T19:40:55.051Z
Learning: In the pacquet Rust code, ensure the semver implementation uses the `node-semver` crate (not `nodejs-semver`). `node-semver`’s public API does not include a `satisfies_with_prerelease`-style method; prerelease-tolerant matching should be implemented inline by first calling `Range::satisfies`, and when it rejects a prerelease version, retry matching against a stripped `MAJOR.MINOR.PATCH` base of the prerelease version.
Applied to files:
pacquet/crates/fs/src/ensure_file.rspacquet/crates/store-dir/src/check_pkg_files_integrity.rspacquet/crates/store-dir/tests/verify_writer_race.rs
📚 Learning: 2026-05-20T23:07:58.444Z
Learnt from: zkochan
Repo: pnpm/pnpm PR: 11784
File: pacquet/crates/resolving-deps-resolver/src/hoist_peers.rs:120-133
Timestamp: 2026-05-20T23:07:58.444Z
Learning: When reviewing code in this pacquet Rust port, follow the upstream pnpm compatibility rule: only match pnpm’s behavior exactly. Do not propose review changes that intentionally deviate from pnpm’s documented/observed behavior, even if pnpm appears buggy. If you identify a real bug in pnpm behavior, the review should prioritize fixing it upstream in pnpm first, and avoid implementing a pnpm-behavior workaround here unless the same fix has already landed upstream.
Applied to files:
pacquet/crates/fs/src/ensure_file.rspacquet/crates/store-dir/src/check_pkg_files_integrity.rspacquet/crates/store-dir/tests/verify_writer_race.rs
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #11825 +/- ##
==========================================
- Coverage 87.58% 87.56% -0.02%
==========================================
Files 204 204
Lines 24126 24134 +8
==========================================
+ Hits 21130 21133 +3
- Misses 2996 3001 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Integrated-Benchmark Report (Linux)Scenario: Frozen Lockfile
BENCHMARK_REPORT.json{
"results": [
{
"command": "pacquet@HEAD",
"mean": 2.5108529235399994,
"stddev": 0.07278386610851326,
"median": 2.50602192784,
"user": 2.8045107199999997,
"system": 3.8364530399999994,
"min": 2.41995201084,
"max": 2.68037487984,
"times": [
2.52225799784,
2.68037487984,
2.46667402484,
2.55806566384,
2.43895093484,
2.48748832384,
2.41995201084,
2.52272154384,
2.52122908984,
2.4908147658399997
]
},
{
"command": "pacquet@main",
"mean": 2.50246213284,
"stddev": 0.07316369430843171,
"median": 2.47922121734,
"user": 2.8134892199999992,
"system": 3.7427576399999998,
"min": 2.4146320438399997,
"max": 2.65537001484,
"times": [
2.47959889484,
2.65537001484,
2.45072237784,
2.4788435398399997,
2.4146320438399997,
2.5824194448399997,
2.52982296884,
2.52945403984,
2.46484369084,
2.43891431284
]
},
{
"command": "pnpm",
"mean": 4.90615295124,
"stddev": 0.08332186036095118,
"median": 4.899686620340001,
"user": 8.207016020000001,
"system": 4.29235154,
"min": 4.80724024984,
"max": 5.051169515840001,
"times": [
4.96068857784,
5.051169515840001,
4.8224825108400005,
5.01598645384,
4.91961155484,
4.9217723728400005,
4.849334504840001,
4.80724024984,
4.87976168584,
4.83348208584
]
}
]
}Scenario: Frozen Lockfile (Hot Cache)
BENCHMARK_REPORT.json{
"results": [
{
"command": "pacquet@HEAD",
"mean": 0.7070786985200002,
"stddev": 0.02547465509121021,
"median": 0.70198123722,
"user": 0.41339625999999996,
"system": 1.5909168999999999,
"min": 0.6837631562200001,
"max": 0.7755508942200001,
"times": [
0.7755508942200001,
0.7089839372200001,
0.70286868822,
0.70945463622,
0.6933932552200001,
0.70565551122,
0.7007211802200001,
0.6837631562200001,
0.7010937862200001,
0.6893019402200001
]
},
{
"command": "pacquet@main",
"mean": 0.7534291311200002,
"stddev": 0.07507416882872918,
"median": 0.7373454877200001,
"user": 0.39031276,
"system": 1.6015404,
"min": 0.6835312532200001,
"max": 0.9002825352200001,
"times": [
0.75647686422,
0.7440627362200001,
0.6906408472200001,
0.6835312532200001,
0.6989095072200001,
0.9002825352200001,
0.7652079112200001,
0.6958497422200001,
0.7306282392200001,
0.8687016752200001
]
},
{
"command": "pnpm",
"mean": 2.6407794508199998,
"stddev": 0.1334952063207141,
"median": 2.59132448622,
"user": 3.30970296,
"system": 2.291588,
"min": 2.54296654422,
"max": 2.99601454722,
"times": [
2.69412061722,
2.58087054922,
2.99601454722,
2.54296654422,
2.58362910222,
2.55575121122,
2.59880970722,
2.59343879622,
2.58921017622,
2.67298325722
]
}
]
} |
Summary
Closes the writer×verifier race that surfaces in CI as
failed to import "<store>/v11/files/…": No such file or directory (os error 2). Whileensure_filewas holding the per-pathcas_write_lockand partway throughwrite_all, a sibling snapshot'scheck_pkg_files_integritywould stat the same CAS path (no lock), see a partial size, andremove_stale_cafs_entryit. The writer'swrite_allthen carried on against an orphan inode; once the fd closed, the inode was freed and the install'scas_pathswas pointing at a now-missing dirent.The fix
Acquire
cas_write_lock(path)inverify_filebefore any branch that could callremove_stale_cafs_entry. The lock is the same per-path mutexensure_filealready holds, so a concurrent writer'swrite_allalways lands before the verifier evaluates.Implementation:
verify_file's fast path stays lock-free. When the file's mtime is within the 100 mschecked_atslack (the common case — file untouched since a prior install), no lock is acquired and no fs operations beyond the original stat.remove_stale_cafs_entry) acquires the lock, re-stats under the lock, and only then decides. A file that vanished between the fast-path stat and lock acquisition surfaces as a graceful "cache miss / re-fetch" instead of a panic.pacquet_fs::cas_write_lockwas promoted fromfntopub fnso the store-dir crate can reach it.Performance
write_all(milliseconds), trading a millisecond-scale wait for avoiding the network re-fetch the install would otherwise have to do.Reproducer test
pacquet/crates/store-dir/tests/verify_writer_race.rs— deterministic, fails pre-fix, passes post-fix.The test holds
cas_write_lockfrom the test thread (standing in for an in-flight writer) and pre-seeds a partial CAS file at the matching path. A background thread runscheck_pkg_files_integrityfor a row that references the same path. The assertion checks whether the file still exists while the "writer" holds the lock:I verified the test flips green only because of the fix:
Test plan
cargo nextest run -p pacquet-store-dir— 105 tests pass (incl. 2 new)cargo nextest run -p pacquet-fs— 28 tests passcargo fmt --all -- --check,just lint,just dylint— cleanWritten by an agent (Claude Code, claude-opus-4-7).
Summary by CodeRabbit