Skip to content

Process-performance improvements#14991

Draft
ConnorBaker wants to merge 6 commits intoNixOS:masterfrom
ConnorBaker:vibe-coding/improvements-for-nix-eval-jobs-upstream
Draft

Process-performance improvements#14991
ConnorBaker wants to merge 6 commits intoNixOS:masterfrom
ConnorBaker:vibe-coding/improvements-for-nix-eval-jobs-upstream

Conversation

@ConnorBaker
Copy link
Copy Markdown
Contributor

@ConnorBaker ConnorBaker commented Jan 13, 2026

Important

I largely vibe-coded this PR. I care far more about solving the problems this PR targets than I do about this PR being merged.

I've split this PR into three (currently) independent branches (based on master, merged in the following order):

I'd like a better abstraction for locking so fetcher and substitution locking don't require their own implementations.

I also still want context for why substitution locking was removed.

Motivation

When Nix is run under multiple processes (like with nix-eval-jobs), certain operations don't employ locking and can be duplicated across processes. With nix-eval-jobs, I most commonly observe the evaluation cache being locked and duplicate fetches and substitutions. This PR sets out to fix these issues.

Context

This PR targets three components.

A gist with some benchmark results (from an earlier version of this work) can be found here: https://gist.github.com/ConnorBaker/9e31d3b08ff6d4ac841928412131fe15.

eval-cache: make SQLite operations process-safe

Refactors eval-cache DB access to use per‑operation transactions with retry/backoff and explicit error handling, so concurrent Nix processes can safely share the same eval cache without SQLITE_BUSY errors or corruption (like is seen with nix-eval-jobs).

fetchers: add process-safe cache to prevent duplicate downloads

Adds inter‑process file locks and double‑checked lookups to the fetcher cache (tarball, git, GitHub/GitLab/Sourcehut archives, etc.), so only one process downloads a given resource while others wait and reuse the cached result. The locked path now respects TTL to avoid stale hits.

store: add process-safe locking to prevent duplicate substitutions

Introduces substitution locks around binary cache downloads, ensuring only one process performs a substitution for a given store path while others wait and reuse the result. Includes timeout controls and cleanup logic for lock files.


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@ConnorBaker ConnorBaker self-assigned this Jan 13, 2026
@github-actions github-actions bot added documentation with-tests Issues related to testing. PRs with tests have some priority store Issues and pull requests concerning the Nix store fetching Networking with the outside (non-Nix) world, input locking labels Jan 13, 2026
@ConnorBaker
Copy link
Copy Markdown
Contributor Author

CC @tomberek.

@ConnorBaker ConnorBaker force-pushed the vibe-coding/improvements-for-nix-eval-jobs-upstream branch from 6ce97d5 to 51addfa Compare January 13, 2026 22:56
@ConnorBaker ConnorBaker changed the title Vibe coding/improvements for nix eval jobs upstream Process-safety improvements Jan 13, 2026
@edolstra
Copy link
Copy Markdown
Member

Thanks! A few comments:

Does this actually address safety issues? Optimistically proceeding with idempotent operations should generally be safe (e.g. downloading/substituting twice is wasteful but not a correctness problem). OTOH adding locking adds a lot of complexity, because now you need to worry about processes that are hanging, lock cleanup etc.

What sort of database corruption did you see with the eval cache? The intended behavior is to give up on updating the cache if we get a "busy" error (see 349d2c5).

Substitution used to be locked, but we lost that in aa3bc3d. Before then it was done like this: See here:

nix/src/libstore/build.cc

Lines 3238 to 3243 in 21e9d18

/* Acquire a lock on the output path. */
outputLock = std::make_shared<PathLocks>();
if (!outputLock->lockPaths(singleton<PathSet>(storePath), "", false)) {
worker.waitForAWhile(shared_from_this());
return;
}

If we do reintroduce substitution locking, it should use the same lock file as used by LocalStore::addToStore(), since we don't want two different kinds of store path locks. However architecturally this is tricky because copyStorePath() doesn't know anything about LocalStore internals, so it can't lock paths. Maybe the Store interface needs to expose a method to lock a path.

@ConnorBaker
Copy link
Copy Markdown
Contributor Author

Does this actually address safety issues? Optimistically proceeding with idempotent operations should generally be safe (e.g. downloading/substituting twice is wasteful but not a correctness problem).

That's a good distinction to make, I think I've conflated the two. Is there any way that correctness here would make assumptions about the behavior of the filesystem (e.g., fsync behavior)?

OTOH adding locking adds a lot of complexity, because now you need to worry about processes that are hanging, lock cleanup etc.

That is true, but (and I am biased here) I think that complexity is well worth it for the common case of running multiple Nix processes in parallel (like with nix-eval-jobs). The alternative (as far as I'm aware) is to leave it up to implementors building on top of Nix to somehow shim in and maintain such logic themselves.

What sort of database corruption did you see with the eval cache? The intended behavior is to give up on updating the cache if we get a "busy" error (see 349d2c5).

In hindsight, this could have been any number of things unrelated to multiple processes trying to interact with the database; I've been hacking on several ways of moving the functionality nix-eval-jobs provides back into the Nix CLI and some of them certainly did unsafe things with respect to the evaluation cache. My ZFS setup is also... nonstandard.

Substitution used to be locked, but we lost that in aa3bc3d. Before then it was done like this: See here:

nix/src/libstore/build.cc

Lines 3238 to 3243 in 21e9d18

/* Acquire a lock on the output path. */
outputLock = std::make_shared<PathLocks>();
if (!outputLock->lockPaths(singleton<PathSet>(storePath), "", false)) {
worker.waitForAWhile(shared_from_this());
return;
}

Thank you for the reference! I'll take a look when I get a chance. Do you have any historical context handy about why this functionality was removed?

If we do reintroduce substitution locking, it should use the same lock file as used by LocalStore::addToStore(), since we don't want two different kinds of store path locks. However architecturally this is tricky because copyStorePath() doesn't know anything about LocalStore internals, so it can't lock paths. Maybe the Store interface needs to expose a method to lock a path.

The thing I'm most interested in is deduplication of work across processes (so I guess process-safety is really a misnomer). I thought file-based locking would be a good way to do that; the only other alternative I could think of would be a shared-memory data structure (like through Boost). As you pointed out, introducing locking means introducing concerns about handling and cleaning up locking. Is there a simple way to do this that I'm missing?

Since there are really three different concerns here (allowing multiple Nix processes to interact with the evaluation cache, fetcher deduplication, and substitution deduplication) and the name of this PR is wrong, I'm leaning towards closing this PR and opening separate ones to address each concern individually. Thoughts?

@ConnorBaker ConnorBaker changed the title Process-safety improvements Process-~~safety~~performance improvements Jan 20, 2026
@ConnorBaker ConnorBaker changed the title Process-~~safety~~performance improvements Process-performance improvements Jan 20, 2026
ConnorBaker and others added 6 commits March 2, 2026 00:31
Problem:
When multiple Nix processes access the same flake's eval cache
simultaneously (e.g., via nix-eval-jobs), SQLITE_BUSY errors and
database corruption could occur.

Solution:
Refactor AttrDb to use per-operation IMMEDIATE transactions with retry
logic instead of long-lived transactions. This enables safe concurrent
access from multiple processes.

Key changes:
- Add SQLiteTxnMode enum (Deferred, Immediate)
- Add retrySQLite<T>() helper with exponential backoff; default unlimited and bounded for eval-cache
- Use std::expected<T, CacheError> for explicit error handling
- Graceful degradation when database errors occur

Adapted from aa5378b for upstream API changes.

Tests: Unit tests + concurrent stress tests added

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add variadic template `setAttr` helper to reduce boilerplate in the
  7 simple setter functions (setBool, setInt, setListOfStrings,
  setPlaceholder, setMissing, setMisc, setFailed)
- Simplify `getKey()` by combining identical error-handling branches
- Convert if-else chain to switch statement in `forceValue()` for
  clearer enum dispatch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Problem:
When multiple Nix processes fetch the same resource simultaneously
(common with nix-eval-jobs), they would all download it independently,
wasting bandwidth and time.

Solution:
Add file-based locking with double-checked locking pattern using flock().
First process to acquire the lock performs the download; others wait and
reuse the cached result.

Key features:
- Lock files in ~/.cache/nix/fetch-locks/ using SHA256 hashes
- New fetch-lock-timeout setting (0 = wait indefinitely)
- Locked lookups respect tarball TTL to avoid stale hits
- Automatic retry with exponential backoff
- Stale lock cleanup (24h threshold)

Fetchers updated: tarball, github, gitlab, sourcehut, git

Warning: flock() does not work reliably on NFS. See release notes.

Tests: Unit tests + timeout tests added

Conflict resolutions vs master:
- git.cc: master refactored isCacheFileWithinTtl() to take a Settings&
  parameter and switched from raw stat() to maybeStat(). Updated
  recheckDoFetch lambda to use the new signatures.
- fetch-settings.hh: master moved tarballTtl from globals.hh into
  fetch-settings.hh. Kept tarballTtl (master) alongside new
  fetchLockTimeout setting.
- tarball.cc: master kept inline download logic in downloadFile() and
  downloadTarball_(). Replaced with cache->lookupOrFetch() / withFetchLock()
  wrappers to coordinate concurrent processes. The locked callbacks
  contain the same download logic, just wrapped.
- unix/pathlocks.cc: master added tryUnlink() and nix::fstat() helpers.
  deleteLockFile: kept branch's conditional-write-after-unlink safety
  (prevents livelock from poisoned stale markers), using raw unlink()
  since tryUnlink() is void and we need the return value. Staleness
  check: used master's nix::fstat() helper, kept branch's extended
  comments explaining why PathLocks uses simpler staleness detection
  than fetch/substitution locks.
- windows/pathlocks.cc: kept branch's conditional DeleteFileW + stale
  marker pattern and correct CloseHandle==0 check (master had ==−1),
  with master's PathFmt wrapper.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Problem:
When multiple Nix processes substitute the same store path from a binary
cache simultaneously, they would all download the NAR independently.

Solution:
Add file-based locking using flock() to coordinate substitutions. First
process to acquire the lock downloads; others wait and reuse the result.

Key features:
- Lock files in ~/.cache/nix/substitution-locks/
- New substitution-lock-timeout setting (0 = wait indefinitely)
- Windows support via lockFileWithTimeout()
- Stale lock cleanup (24h threshold)

Complements the fetcher lock mechanism for complete process coordination.

Warning: flock() does not work reliably on NFS. See release notes.

Tests: 11 tests covering thread/process contention scenarios
(deterministic sync for process cases)

Conflict resolutions vs master:
- globals.hh: master moved tarballTtl from globals.hh to
  fetch-settings.hh. Dropped the duplicate tarballTtl that the original
  branch carried; kept only the new substitutionLockTimeout setting.
- store-api.cc: master added includes for environment-variables.hh,
  file-system.hh, and store-config-private.hh. Kept all of master's
  includes alongside the new substitution-lock-impl.hh include.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…coding/improvements-for-nix-eval-jobs-upstream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation fetching Networking with the outside (non-Nix) world, input locking store Issues and pull requests concerning the Nix store with-tests Issues related to testing. PRs with tests have some priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants