Process-performance improvements#14991
Conversation
|
CC @tomberek. |
6ce97d5 to
51addfa
Compare
|
Thanks! A few comments: Does this actually address safety issues? Optimistically proceeding with idempotent operations should generally be safe (e.g. downloading/substituting twice is wasteful but not a correctness problem). OTOH adding locking adds a lot of complexity, because now you need to worry about processes that are hanging, lock cleanup etc. What sort of database corruption did you see with the eval cache? The intended behavior is to give up on updating the cache if we get a "busy" error (see 349d2c5). Substitution used to be locked, but we lost that in aa3bc3d. Before then it was done like this: See here: Lines 3238 to 3243 in 21e9d18 If we do reintroduce substitution locking, it should use the same lock file as used by |
That's a good distinction to make, I think I've conflated the two. Is there any way that correctness here would make assumptions about the behavior of the filesystem (e.g., fsync behavior)?
That is true, but (and I am biased here) I think that complexity is well worth it for the common case of running multiple Nix processes in parallel (like with
In hindsight, this could have been any number of things unrelated to multiple processes trying to interact with the database; I've been hacking on several ways of moving the functionality
Thank you for the reference! I'll take a look when I get a chance. Do you have any historical context handy about why this functionality was removed?
The thing I'm most interested in is deduplication of work across processes (so I guess process-safety is really a misnomer). I thought file-based locking would be a good way to do that; the only other alternative I could think of would be a shared-memory data structure (like through Boost). As you pointed out, introducing locking means introducing concerns about handling and cleaning up locking. Is there a simple way to do this that I'm missing? Since there are really three different concerns here (allowing multiple Nix processes to interact with the evaluation cache, fetcher deduplication, and substitution deduplication) and the name of this PR is wrong, I'm leaning towards closing this PR and opening separate ones to address each concern individually. Thoughts? |
Problem: When multiple Nix processes access the same flake's eval cache simultaneously (e.g., via nix-eval-jobs), SQLITE_BUSY errors and database corruption could occur. Solution: Refactor AttrDb to use per-operation IMMEDIATE transactions with retry logic instead of long-lived transactions. This enables safe concurrent access from multiple processes. Key changes: - Add SQLiteTxnMode enum (Deferred, Immediate) - Add retrySQLite<T>() helper with exponential backoff; default unlimited and bounded for eval-cache - Use std::expected<T, CacheError> for explicit error handling - Graceful degradation when database errors occur Adapted from aa5378b for upstream API changes. Tests: Unit tests + concurrent stress tests added 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add variadic template `setAttr` helper to reduce boilerplate in the 7 simple setter functions (setBool, setInt, setListOfStrings, setPlaceholder, setMissing, setMisc, setFailed) - Simplify `getKey()` by combining identical error-handling branches - Convert if-else chain to switch statement in `forceValue()` for clearer enum dispatch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Problem: When multiple Nix processes fetch the same resource simultaneously (common with nix-eval-jobs), they would all download it independently, wasting bandwidth and time. Solution: Add file-based locking with double-checked locking pattern using flock(). First process to acquire the lock performs the download; others wait and reuse the cached result. Key features: - Lock files in ~/.cache/nix/fetch-locks/ using SHA256 hashes - New fetch-lock-timeout setting (0 = wait indefinitely) - Locked lookups respect tarball TTL to avoid stale hits - Automatic retry with exponential backoff - Stale lock cleanup (24h threshold) Fetchers updated: tarball, github, gitlab, sourcehut, git Warning: flock() does not work reliably on NFS. See release notes. Tests: Unit tests + timeout tests added Conflict resolutions vs master: - git.cc: master refactored isCacheFileWithinTtl() to take a Settings& parameter and switched from raw stat() to maybeStat(). Updated recheckDoFetch lambda to use the new signatures. - fetch-settings.hh: master moved tarballTtl from globals.hh into fetch-settings.hh. Kept tarballTtl (master) alongside new fetchLockTimeout setting. - tarball.cc: master kept inline download logic in downloadFile() and downloadTarball_(). Replaced with cache->lookupOrFetch() / withFetchLock() wrappers to coordinate concurrent processes. The locked callbacks contain the same download logic, just wrapped. - unix/pathlocks.cc: master added tryUnlink() and nix::fstat() helpers. deleteLockFile: kept branch's conditional-write-after-unlink safety (prevents livelock from poisoned stale markers), using raw unlink() since tryUnlink() is void and we need the return value. Staleness check: used master's nix::fstat() helper, kept branch's extended comments explaining why PathLocks uses simpler staleness detection than fetch/substitution locks. - windows/pathlocks.cc: kept branch's conditional DeleteFileW + stale marker pattern and correct CloseHandle==0 check (master had ==−1), with master's PathFmt wrapper. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Problem: When multiple Nix processes substitute the same store path from a binary cache simultaneously, they would all download the NAR independently. Solution: Add file-based locking using flock() to coordinate substitutions. First process to acquire the lock downloads; others wait and reuse the result. Key features: - Lock files in ~/.cache/nix/substitution-locks/ - New substitution-lock-timeout setting (0 = wait indefinitely) - Windows support via lockFileWithTimeout() - Stale lock cleanup (24h threshold) Complements the fetcher lock mechanism for complete process coordination. Warning: flock() does not work reliably on NFS. See release notes. Tests: 11 tests covering thread/process contention scenarios (deterministic sync for process cases) Conflict resolutions vs master: - globals.hh: master moved tarballTtl from globals.hh to fetch-settings.hh. Dropped the duplicate tarballTtl that the original branch carried; kept only the new substitutionLockTimeout setting. - store-api.cc: master added includes for environment-variables.hh, file-system.hh, and store-config-private.hh. Kept all of master's includes alongside the new substitution-lock-impl.hh include. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…improvements-for-nix-eval-jobs-upstream
…coding/improvements-for-nix-eval-jobs-upstream
51addfa to
12da128
Compare
Important
I largely vibe-coded this PR. I care far more about solving the problems this PR targets than I do about this PR being merged.
I've split this PR into three (currently) independent branches (based on master, merged in the following order):
I'd like a better abstraction for locking so fetcher and substitution locking don't require their own implementations.
I also still want context for why substitution locking was removed.
Motivation
When Nix is run under multiple processes (like with
nix-eval-jobs), certain operations don't employ locking and can be duplicated across processes. With nix-eval-jobs, I most commonly observe the evaluation cache being locked and duplicate fetches and substitutions. This PR sets out to fix these issues.Context
This PR targets three components.
A gist with some benchmark results (from an earlier version of this work) can be found here: https://gist.github.com/ConnorBaker/9e31d3b08ff6d4ac841928412131fe15.
eval-cache: make SQLite operations process-safe
Refactors eval-cache DB access to use per‑operation transactions with retry/backoff and explicit error handling, so concurrent Nix processes can safely share the same eval cache without SQLITE_BUSY errors or corruption (like is seen with nix-eval-jobs).
fetchers: add process-safe cache to prevent duplicate downloads
Adds inter‑process file locks and double‑checked lookups to the fetcher cache (tarball, git, GitHub/GitLab/Sourcehut archives, etc.), so only one process downloads a given resource while others wait and reuse the cached result. The locked path now respects TTL to avoid stale hits.
store: add process-safe locking to prevent duplicate substitutions
Introduces substitution locks around binary cache downloads, ensuring only one process performs a substitution for a given store path while others wait and reuse the result. Includes timeout controls and cleanup logic for lock files.
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.