Skip to content

libstore: Bit-reproducibly fix darwin Mach-O page hashes after rewriting#15638

Open
ak2k wants to merge 1 commit intoNixOS:masterfrom
ak2k:darwin-mach-o-page-hash-fixup
Open

libstore: Bit-reproducibly fix darwin Mach-O page hashes after rewriting#15638
ak2k wants to merge 1 commit intoNixOS:masterfrom
ak2k:darwin-mach-o-page-hash-fixup

Conversation

@ak2k
Copy link
Copy Markdown

@ak2k ak2k commented Apr 8, 2026

Based on the work of @andrewgazelka in #14999. Happy to close this and fold into that branch instead if @andrewgazelka prefers; see the Context section for the technical differences.

Motivation

On darwin, DerivationBuilderImpl::registerOutputs calls RewritingSink to substitute scratch-path bytes in build outputs. When a sibling output is already in the store at build start, its scratch path is a makeFallbackPath-synthesised stand-in, and RewritingSink rewrites those scratch-path bytes to the sibling's final path after the builder exits. The substitution is byte-level and has no knowledge of Mach-O code signatures, but Apple's ld ad-hoc-signs every binary at link time with the linker-signed flag set in LC_CODE_SIGNATURE. The signature covers the very bytes that were just rewritten, so one or more SHA-256 page hashes in the CodeDirectory are stale after the rewrite. At first page-in, the macOS kernel SIGKILLs the process with cs_invalid_page.

This is the root cause of NixOS/nixpkgs#507531 (fish on nixpkgs-darwin fails to start) and NixOS/nix#6065 (open since 2022 against CA derivations; same mechanism, wider surface).

Context

Add a darwin-only helper, fixupMachoPageHashes, called from the rewriteOutput lambda after movePath and before canonicalisePathMetaData in derivation-builder.cc. The helper recomputes only the affected page-hash slots in place, leaving every other byte of the Mach-O unchanged, including the linker-signed flag, the original 4 KiB page size, the identifier, and every special slot. The fix-up is length-preserving, so the result is bit-identical to a clean build of the same input.

The single call site covers both InputAddressed and CAFloating/CAFixed/Impure visitors, since they all go through the shared rewriteOutput lambda.

Differences from #14999

Aspect #14999 (darwin-codesign.cc) This PR
Call-site coverage CA visitor only Shared rewriteOutput lambda → both IA and CA
Bit-reproducibility No (three structural differences below) Yes (slot-only rewrite preserves every structural field)
CodeDirectory flags linker-signed cleared by codesign -s - Preserved
Special-slot layout codesign adds special slots (non-minimal CodeDirectory) Preserved (zero special slots, matching linker-signed)
Default page size 16 KiB on arm64 (fixable via codesign -P 4096) Preserved (4 KiB)
External dependencies Forks /usr/bin/codesign per Mach-O file In-process, no subprocess

Verification

Run against a patched daemon on aarch64-darwin macOS 26.2, using a standalone synthetic reproducer (a multi-output stdenv.mkDerivation whose bin/hello embeds ${builtins.placeholder "doc"}).

Verification script (check.py, Python stdlib only)

The reproductions below use a small Python script to report nCodeSlots, pageSize, codeLimit, and per-slot SHA-256 mismatches against LC_CODE_SIGNATURE.CodeDirectory. It is standalone Python 3 with no third-party dependencies, so it runs on any macOS with system Python. Save as check.py and invoke as python3 check.py <binary>.

#!/usr/bin/env python3
import hashlib, struct, sys
data = open(sys.argv[1], "rb").read()
ncmds = struct.unpack_from("<I", data, 16)[0]
lc_off = 32
sig_off = sig_size = None
for _ in range(ncmds):
    cmd, cmdsize = struct.unpack_from("<II", data, lc_off)
    if cmd == 0x1d:  # LC_CODE_SIGNATURE
        sig_off, sig_size = struct.unpack_from("<II", data, lc_off + 8)
        break
    lc_off += cmdsize
if sig_off is None:
    print("no LC_CODE_SIGNATURE"); sys.exit(2)
blob = data[sig_off:sig_off+sig_size]
def u32be(off): return struct.unpack_from(">I", blob, off)[0]
sb_count = u32be(8)
cd_rel = next(u32be(16+i*8) for i in range(sb_count) if u32be(12+i*8) == 0)
cd = blob[cd_rel:]
hashOffset = struct.unpack_from(">I", cd, 16)[0]
n = struct.unpack_from(">I", cd, 28)[0]
ps = 1 << cd[39]
limit = struct.unpack_from(">I", cd, 32)[0]
print(f"nCodeSlots={n} pageSize={ps} codeLimit=0x{limit:x}")
mismatches = [(i, i*ps) for i in range(n)
              if cd[hashOffset+i*32:hashOffset+(i+1)*32] !=
                 hashlib.sha256(data[i*ps:min((i+1)*ps,limit)]).digest()]
print(f"{len(mismatches)}/{n} mismatches")
for i, off in mismatches[:5]:
    print(f"  page {i} @ 0x{off:08x}")

Assumes a thin (single-arch) Mach-O; none of the reproducers below produce fat binaries on aarch64-darwin.

IA cold build (patched daemon)
$ nix --extra-experimental-features "nix-command flakes" \
      build --no-link --print-out-paths .#hello-multi
this derivation will be built:
  /nix/store/n7p541aylf48lacna2il5vmk80fk9vry-hello-multi-1.0.drv
these 2 paths will be fetched (16.5 KiB download, 1.5 MiB unpacked):
  /nix/store/bvrbfzyimpjxwn679252bhbbccnb43nr-gawk-5.3.2
  /nix/store/4c9ajc7qmxd8kaanj1c9v0fbi60bn805-stdenv-darwin
copying path '/nix/store/bvrbfzyimpjxwn679252bhbbccnb43nr-gawk-5.3.2' from 'https://cache.numtide.com'...
copying path '/nix/store/4c9ajc7qmxd8kaanj1c9v0fbi60bn805-stdenv-darwin' from 'https://cache.nixos.org'...
building '/nix/store/n7p541aylf48lacna2il5vmk80fk9vry-hello-multi-1.0.drv'...
/nix/store/ncyd8r6v7d245vijzr259h9mf4f7mvrs-hello-multi-1.0

$ codesign --verify /nix/store/ncyd8r6v7d245vijzr259h9mf4f7mvrs-hello-multi-1.0/bin/hello
# rc=0, valid

$ /nix/store/ncyd8r6v7d245vijzr259h9mf4f7mvrs-hello-multi-1.0/bin/hello
hello, docs at /nix/store/crpy7nzwzlwv7l112w9w2pk2v4wbpa8w-hello-multi-1.0-doc/share/doc/hello, self at /nix/store/ncyd8r6v7d245vijzr259h9mf4f7mvrs-hello-multi-1.0/bin/hello
# rc=0

$ python3 check.py .../bin/hello
nCodeSlots=13 pageSize=4096 codeLimit=0xc160
0/13 mismatches

$ codesign -dvvv .../bin/hello
Format=Mach-O thin (arm64)
CodeDirectory v=20400 size=520 flags=0x20002(adhoc,linker-signed) hashes=13+0 location=embedded
Signature=adhoc
IA --rebuild (the bug trigger; patched daemon — full bit-reproducibility achieved)
$ nix --extra-experimental-features "nix-command flakes" \
      build --rebuild --keep-failed --no-link .#hello-multi
checking outputs of '/nix/store/n7p541aylf48lacna2il5vmk80fk9vry-hello-multi-1.0.drv'...
# rc=0  ← no determinism error, no .check produced

$ ls /nix/store/ncyd8r6v7d245vijzr259h9mf4f7mvrs-hello-multi-1.0.check
ls: cannot access '...check': No such file or directory

The .check directory does not exist. The rebuilt NAR is byte-identical to the cold build, so Nix's determinism check passes silently and --rebuild returns 0 without producing a .check divergence directory. The rewrite ran (outputRewrites was populated because doc was already in the store from the cold build), the helper recomputed the affected page hashes in place, and the result is byte-identical because the helper preserves every other byte of the Mach-O including the LC_UUID payload.

CA cold build (patched daemon — closes #6065)

The CA path triggers the bug on the cold build itself because CA derivations always go through the rewrite path (the scratch hash and the final content-addressed hash always differ). No --rebuild needed.

$ nix --extra-experimental-features "nix-command flakes ca-derivations" \
      build --no-link --print-out-paths --file ./ca.nix
this derivation will be built:
  /nix/store/wh94xbqrn6p17302kqccvmqbp61gxsld-hello-multi-ca-1.0.drv
building '/nix/store/5v3mxyj4gd7kb4m8g853kki7vl26194x-hello-multi-ca-1.0.drv'...
/nix/store/pfsyfh6a64krkc0h7v24gcv52xhzhmjq-hello-multi-ca-1.0

$ codesign --verify /nix/store/pfsyfh6a64krkc0h7v24gcv52xhzhmjq-hello-multi-ca-1.0/bin/hello
# rc=0, valid

$ /nix/store/pfsyfh6a64krkc0h7v24gcv52xhzhmjq-hello-multi-ca-1.0/bin/hello
hello, docs at /nix/store/hvjwn0vfad9acwapbw9pqzfcdvza8ysd-hello-multi-ca-1.0-doc/share/doc/hello, self at /nix/store/pfsyfh6a64krkc0h7v24gcv52xhzhmjq-hello-multi-ca-1.0/bin/hello
# rc=0

$ python3 check.py .../bin/hello
nCodeSlots=13 pageSize=4096 codeLimit=0xc160
0/13 mismatches

$ codesign -dvvv .../bin/hello
Format=Mach-O thin (arm64)
CodeDirectory v=20400 size=520 flags=0x20002(adhoc,linker-signed) hashes=13+0 location=embedded
Signature=adhoc

Same pageSize=4096, same nCodeSlots=13, same linker-signed flag, same adhoc signature format as the IA case. The "single call site covers both IA and CA" claim is empirically tested, not a code-reading argument. This is the variant that directly exercises #6065, which was filed against CA derivations specifically.

Negative control: unpatched daemon (Nix 2.24.10 — has the bug)

To prove the fix is real, the same reproducer was run via an unpatched nix-daemon (Nix 2.24.10). The cold build succeeds (cold builds never trigger the bug), but --rebuild materialises the corrupted binary in a .check directory:

$ nix --extra-experimental-features "nix-command flakes" \
      build --no-link --print-out-paths .#hello-multi
/nix/store/aa4wzgjqd9gbzjq10z29acvy5w4vmlhi-hello-multi-1.0
# Cold build: codesign rc=0, run rc=0  ← cold build never triggers the bug

$ nix --extra-experimental-features "nix-command flakes" \
      build --rebuild --keep-failed --no-link .#hello-multi
checking outputs of '/nix/store/b2lmrkivjmk6ypzm9h4qp2ji1xaw0y79-hello-multi-1.0.drv'...
note: keeping build directory '/private/tmp/nix-build-hello-multi-1.0.drv-1'
error: derivation '/nix/store/b2lmrkivjmk6ypzm9h4qp2ji1xaw0y79-hello-multi-1.0.drv' may not be deterministic: output '/nix/store/aa4wzgjqd9gbzjq10z29acvy5w4vmlhi-hello-multi-1.0' differs from '/nix/store/aa4wzgjqd9gbzjq10z29acvy5w4vmlhi-hello-multi-1.0.check'

$ codesign --verify /nix/store/aa4wzgjqd9gbzjq10z29acvy5w4vmlhi-hello-multi-1.0.check/bin/hello
/nix/store/aa4wzgjqd9gbzjq10z29acvy5w4vmlhi-hello-multi-1.0.check/bin/hello: invalid signature (code or signature have been modified)
In architecture: arm64
# rc=1

$ /nix/store/aa4wzgjqd9gbzjq10z29acvy5w4vmlhi-hello-multi-1.0.check/bin/hello
# rc=137  ← SIGKILL by macOS kernel (cs_invalid_page)

$ python3 check.py .../check/bin/hello
nCodeSlots=13 pageSize=4096 codeLimit=0xc160
1/13 mismatches
  page 3 @ 0x00003000

Same nCodeSlots=13 and same affected page (slot 3, file offset 0x3000) as the patched daemon's cold build. This confirms three things in one negative control: (1) the bug fires reliably on Nix 2.24.10; (2) the rewrite affects exactly one page, exactly the page our analysis predicted; (3) the patched daemon's cold-build slot layout is bit-identical to the unpatched daemon's — the helper preserves the original CodeDirectory structure entirely.

Non---check reproduction on production fish (no determinism check involved)

This demonstrates that the mechanism fires on an ordinary nix build whenever a sibling output is present in the store at build start. No --check, no --rebuild, no special flags — nix-store --delete <fish-out> removes the binary while leaving fish-doc in the store, then an ordinary nix build reproduces the corruption.

$ nix-store --delete /nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1

$ nix build --no-link --print-out-paths --option substitute false \
      'github:nixos/nixpkgs/d96b37bbeb9840f1c0ebfe90585ef5067b69bbb3#fish'

$ codesign --verify /nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1/bin/fish
/nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1/bin/fish: invalid signature (code or signature have been modified)
In architecture: arm64

$ /nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1/bin/fish --version
zsh: killed   /nix/store/gngn7y9m.../bin/fish --version
# rc=137 — SIGKILL by macOS kernel (cs_invalid_page in system log)

$ python3 check.py /nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1/bin/fish
nCodeSlots=2526 pageSize=4096 codeLimit=0x9dd790
1/2526 mismatches
  page 1872 @ 0x00750000

The trigger is fish-doc being in the store at build start: outputRewrites is populated with fish-doc's scratch→final hash mapping, RewritingSink substitutes those bytes inside bin/fish's __TEXT,__cstring page at file offset 0x750000, and the corresponding page hash slot in LC_CODE_SIGNATURE.CodeDirectory (slot 1872, the only one of 2526 that mismatches) becomes stale. The kernel SIGKILLs at first page-in.

Non---check reproduction on production fish, with the patched daemon (positive result — closes nixpkgs#507531 on the real package)

Same sequence as the previous block, same nixpkgs revision, same trigger condition (fish-doc is in the store at build start; the bin output is deleted). The only variable is the daemon, which carries this PR's commit 883e43319.

$ nix-store --delete /nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1

$ nix build --no-link --print-out-paths --option substitute false \
      'github:nixos/nixpkgs/d96b37bbeb9840f1c0ebfe90585ef5067b69bbb3#fish'
/nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1

$ /usr/bin/codesign --verify /nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1/bin/fish
# rc=0, valid

$ /nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1/bin/fish --version
fish, version 4.2.1
# rc=0

$ python3 check.py /nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1/bin/fish
nCodeSlots=2526 pageSize=4096 codeLimit=0x9dd790
0/2526 mismatches

$ /usr/bin/codesign -dvvv /nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1/bin/fish
Format=Mach-O thin (arm64)
CodeDirectory v=20400 size=80936 flags=0x20002(adhoc,linker-signed) hashes=2526+0 location=embedded
Signature=adhoc

Where the previous block produced 1/2526 mismatches at page 1872 and a SIGKILL at page-in, this block produces 0/2526 and a runnable binary. The CodeDirectory has flags=0x20002 (adhoc,linker-signed) (only ld can set the linker-signed bit; codesign -f -s - clears it), hashes=2526+0 (minimal special-slot layout matching ld's original, not the non-minimal layout that codesign -f -s - produces), and pageSize=4096 (matching ld -adhoc_codesign's default). Same store path, same nixpkgs revision, same trigger condition — the only variable is the daemon.

A note on byte-level non-determinism across rebuilds. Independent patched-daemon rebuilds of fish 4.2.1 produce distinct NAR hashes, because fish's own build is not bit-reproducible across runs independent of this fix. On the host where this block was produced, two cold builds with no sibling in the store — builds where the post-rewrite helper never runs — differ from each other by hundreds of thousands of bytes, almost entirely in Rust-generated codegen sections and Apple ld's per-invocation LC_UUID payload. The helper is a surgical in-place update to the CodeDirectory's page hash slots and cannot introduce variability to code pages, DWARF sections, or LC_UUID. The bit-reproducibility claim of this PR is demonstrated against the reproducible synthetic hello-multi reproducer in the blocks above (where no Rust/sphinx variability applies); fish is the runtime-correctness validation against the actual package from the original reports.

Source-level trace (Nix master a37db9d24, the base commit of this PR)
  1. The rewriteOutput lambdaderivation-builder.cc:1634-1662. Wraps RewritingSink around dumpPath(actualPath), streams through restorePath into <path>.tmp, then deletePath, movePath, and canonicalisePathMetaData. Nothing in this lambda has any awareness of Mach-O or code signatures; the substitution is byte-level and format-agnostic.

  2. The RewritingSink::operator() implementationreferences.cc:72-89. Pure rewriteStrings(s, rewrites) with an equal-length assertion at construction — no file-format awareness, no signing knowledge.

  3. The InputAddressed visitorderivation-builder.cc:1784-1788:

    rewriteOutput(outputRewrites);                       // byte substitution runs
    HashResult narHashAndSize = hashPath(                // hashes corrupt bytes
        {getFSSourceAccessor(), CanonPath(actualPath.native())},
        FileSerialisationMethod::NixArchive,
        HashAlgorithm::SHA256);

    hashPath is the immediately-next call after rewriteOutput with nothing between them that could re-sign the file. Any corruption introduced by the substitution is locked into the NAR hash on the next line.

  4. The CA visitor pathnewInfoFromCA at derivation-builder.cc:1689, with the first rewriteOutput call at line 1705. Same lambda, same byte substitution, reached via the CAFloating / CAFixed / Impure visitors.

Because (1) is the shared lambda called from both (3) and (4), the helper this PR adds is inserted once inside (1) and covers both IA and CA visitor paths in a single call site.

On why the mechanism fires on darwin and not Linux: needsHashRewrite() returns true unconditionally in the base class — on Linux with chroot it's overridden to return false and outputRewrites stays empty for already-known paths. On darwin, it returns true, so whenever a sibling output is already present in the store at build start, that output's scratch path becomes a makeFallbackPath-synthesised stand-in, populating outputRewrites with a scratchHash → finalHash entry. When the other output runs through the rewriteOutput lambda above, that entry is what gets substituted into its binary — and when that binary is a ld -adhoc_codesign-signed Mach-O, the substituted bytes fall inside a page already covered by a linker-signed SHA-256 hash in LC_CODE_SIGNATURE.

Out of scope (follow-ups)

  • Lift the #ifdef __APPLE__ out of the shared rewriteOutput lambda via a virtual hook on DarwinDerivationBuilder.
  • Unit tests for the parser's defensive branches; the functional test already exercises the runtime path that catches stale page hashes.
  • 64-bit fat binary support (currently throws Error; modern macOS ships 32-bit fat).
  • Dual SHA-1 + SHA-256 CodeDirectory test fixture.

@github-actions github-actions bot added documentation with-tests Issues related to testing. PRs with tests have some priority labels Apr 8, 2026
@ak2k ak2k force-pushed the darwin-mach-o-page-hash-fixup branch from ad67af6 to 12dde40 Compare April 8, 2026 00:38
@ak2k ak2k force-pushed the darwin-mach-o-page-hash-fixup branch from 12dde40 to d5db6f1 Compare April 8, 2026 01:01
@Ericson2314
Copy link
Copy Markdown
Member

To be clear, this not about placeholders, but scratch paths, right?

@ak2k ak2k force-pushed the darwin-mach-o-page-hash-fixup branch from d5db6f1 to 789046f Compare April 8, 2026 03:03
@ak2k
Copy link
Copy Markdown
Author

ak2k commented Apr 8, 2026

Yes, thank you for the catch. RewritingSink substitutes scratch-path bytes (from makeFallbackPath-synthesised paths populated in outputRewrites when a sibling output is already in the store at build start), not builtins.placeholder bytes. The placeholder is the user-facing hook that causes a scratch path to end up embedded in the binary in the first place, but the bug is specifically in the scratch-path → final-path byte substitution after the builder exits.

Amending the PR description, commit message, and rl-next entry to use the precise terminology. The mechanism and the fix are unchanged — only the wording was wrong.

This is based on the work of @andrewgazelka in NixOS#14999.

On darwin, `DerivationBuilderImpl::registerOutputs` calls `RewritingSink`
to substitute scratch-path bytes in build outputs. When one output's
binary embeds a sibling output's store path via `${builtins.placeholder
"doc"}` and the sibling is already in the store at build start, the
embedded bytes are the sibling's scratch path (from `makeFallbackPath`),
which `registerOutputs` rewrites to the final hash after the builder
exits. The substitution is byte-level and has no knowledge of Mach-O
code signatures, but Apple's `ld` ad-hoc-signs every binary at link
time with the `linker-signed` flag set in `LC_CODE_SIGNATURE`. The
signature covers the very bytes that were just rewritten, so one or
more SHA-256 page hashes in the `CodeDirectory` are stale after the
rewrite, and the macOS kernel SIGKILLs the binary at first page-in
with `cs_invalid_page`. This surfaces in nixpkgs as
NixOS/nixpkgs#507531 (fish on
`nixpkgs-darwin` fails to start).

Add a darwin-only helper, `fixupMachoPageHashes`, called from the
`rewriteOutput` lambda after `movePath` and before
`canonicalisePathMetaData`. The helper recomputes only the affected
page-hash slots in place, leaving every other byte of the Mach-O
unchanged, including the `linker-signed` flag, the original 4 KiB
page size, the identifier, and every special slot. The fix-up is
length-preserving, so the result is bit-identical to a clean build of
the same input.

The same call site covers both `InputAddressed` and `CAFloating`/
`CAFixed`/`Impure` visitors, since they all go through the shared
`rewriteOutput` lambda.

Differences from NixOS#14999: (a) cover the InputAddressed call site, which
is what bites fish on `nixpkgs-darwin`; (b) preserve bit-reproducibility
so `nix build --check` does not fire on the rewrite alone; (c) preserve
`linker-signed` and the 4 KiB page size, both of which `codesign(1)`
would clear.

Closes: NixOS#6065
@ak2k ak2k force-pushed the darwin-mach-o-page-hash-fixup branch from 789046f to 883e433 Compare April 8, 2026 03:13
@Ericson2314
Copy link
Copy Markdown
Member

So I'll admit my current plan is to... just not have self references in the binary. How important is it to have paths like these?

(For other-output references, the idea is that you imperatively registered outputs, so you could do this sort of thing manually with Nix helping. E.g. install lib output, get store path back, use it in bin output.)

@emilazy
Copy link
Copy Markdown
Member

emilazy commented Apr 8, 2026

This is the root cause of NixOS/nixpkgs#507531

I doubt it: I expect this is an instance of the long-standing NixOS/nixpkgs#208951 bug, which is cursed and nondeterministic (fresh rebuilds tend to fix it and @zhaofengli found that it even depends on whether a machine has built the derivation before at all).

Since --check/--rebuild have the separate issue of clobbering code signatures, this patch will appear to fix the root cause under an obvious testing methodology without actually doing so.

(And TBH encoding Mach-O knowledge deep in Nix guts is a pretty awful layering violation, there are avenues to fix the rewriting issue that don't require that.)

@ak2k
Copy link
Copy Markdown
Author

ak2k commented Apr 8, 2026

So I'll admit my current plan is to... just not have self references in the binary.

Makes sense, and seems like the vastly superior architecture move. I wonder if something like this PR offers a short-term bridge for the existing darwin breakage (nixpkgs#507531, nixpkgs#208951) in the interim?

@emilazy
Copy link
Copy Markdown
Member

emilazy commented Apr 8, 2026

I don't see evidence that this PR fixes those issues.

@ak2k
Copy link
Copy Markdown
Author

ak2k commented Apr 8, 2026

Thank you, @emilazy. I'll try to address those three:

On whether this PR addresses the root cause of NixOS/nixpkgs#507531 / NixOS/nixpkgs#208951. You observed in nix-darwin#693 comment 38: "The bin/git in the broken derivation is identical to the correct one except for 32 bytes in the code signature section (a hash, maybe?). The share/man/man1/git.1.gz files, when uncompressed, differ only in the derivation hashes of the git-2.41.0-doc paths they link to." 32 bytes is the size of a SHA-256 hash entry in a CodeDirectory (CS_HASHSIZE_SHA256 = 256 bits), and the inline verification script in the PR description shows a single code-slot mismatch in each reproduction. The differently-embedded -doc paths match what RewritingSink rewrites when outputRewrites gets populated by a sibling output's presence in the store. My read is that this is the mechanism you were looking at. @winterqt later made an adjacent observation in #208951 comment 23: "It overrides the store path in the binaries, which breaks the code signatures of (at least) libraries (as they have their path embedded within). […]" I've added a source-level trace to the PR description for anyone who wants to follow the code path.

What this PR addresses is that specific mechanism — scratch-path → final-path byte substitution inside Mach-O code pages already covered by linker-signed page hashes. I shouldn't claim it covers every report in #208951; @tomberek's recent observation about "corrupted signature has been seen on different parts of the closure" may be a different root cause, of course.

This also explains the apparent nondeterminism that's made the bug hard to pin down. The trigger is hidden state: whether a sibling output happened to be in the store when the build started. On a fresh machine building an IA multi-output derivation cold, both outputs come out of the same builder run, neither is in the store yet, outputRewrites is empty for the sibling, and the binary is clean. On a machine that already has the sibling — from a prior build of this derivation, or substitution of one output but not the other — outputRewrites gets populated and RewritingSink runs over the binary. As I read it, this predicts that a cold-store build will produce a clean binary (it's the absence of the sibling at build start that matters, not the rebuild itself), and that whether a particular machine has previously built the derivation is what determines whether the bug surfaces. This would also explain why Hydra's cached artifacts tend to be valid: Hydra typically schedules a derivation's outputs together from a state where neither is in the store yet, so the rewrite path isn't exercised.

On the test methodology. I've added a non---check reproduction to the PR description: nix-store --delete <fish-out> followed by an ordinary nix build, with fish-doc still in the store, reproducing the same single page-hash mismatch. A positive counterpart against the patched daemon is included alongside it, showing the same derivation rebuilt cleanly (0/2526 mismatches). As I read the trigger condition, it isn't --check — it's "a sibling output is present in the store when the build starts". --check happens to force that state in a fresh store, which is convenient for the functional test, but an ordinary upgrade, a nix-store --delete of one output, or a failed build that left half the outputs behind would reach the same code path.

On the layering violation. I agree — Mach-O parsing logic in libstore is structurally wrong, and this PR makes it a little more wrong by adding page-hash recomputation alongside the existing darwin-specific code. My read is that the deeper cause is length-preserving byte substitution on already-signed outputs: ld -adhoc_codesign signs bytes that the daemon then rewrites via RewritingSink, and the Mach-O fixup only exists to repair what that substitution broke. The two avenues I can see that would actually move Mach-O knowledge out are @Ericson2314's imperative-output rework above (which would make the whole substitution machinery unnecessary) and a nixpkgs-side post-build hook (I'm not sure if the existing post-build-hook mechanism runs early enough to intercept the rewrite).

Short of those, both #14999 and this PR fix the corruption from inside the daemon, via different mechanisms. #14999 (@andrewgazelka) shells out to codesign -f -s - to re-sign the binary. That produces a working signature but not a bit-identical one — codesign's re-signed CodeDirectory differs from the original linker-signed signature in flags (the linker-signed bit is cleared), special-slot layout, and default page size. codesign -P 4096 would address the page size, but the other two remain, so rebuilds still fail --check. This PR recomputes the affected page hashes in place, which preserves bit-reproducibility.

@ak2k ak2k marked this pull request as ready for review April 8, 2026 21:21
@emilazy
Copy link
Copy Markdown
Member

emilazy commented Apr 8, 2026

a nix-store --delete of one output, or a failed build that left half the outputs behind would reach the same code path.

I don’t think these correspond to the circumstances in which we see Hydra produce these broken outputs.

It’s expected that local rebuilds of the derivations won’t have any issues, since it’s apparently nondeterministic and seemingly partially dependent on persistent system state of some kind. So building a broken‐in‐the‐cache derivation locally and seeing that it doesn’t exhibit the issue does not demonstrate that this PR fixes the issue; that’s already what we observe with no change.

I know @zhaofengli had a somewhat reproducible test setup, but it was difficult to arrange.

@emilazy
Copy link
Copy Markdown
Member

emilazy commented Apr 9, 2026

(For clarity: the sibling outputs thing is an interesting observation that I can imagine might have something to do with what we’re seeing here, but given the amount of times we see random stuff in staging-next crashing at startup from this bug, I’d be pretty surprised if they happened to all be builds that had failed before but still produced some outputs, and then got built again on the same machine with the output not having been cleaned up, or similar.)

ak2k added a commit to ak2k/nix-507531-repro that referenced this pull request Apr 9, 2026
Three darwin-only flake apps targeting aarch64-darwin:

- ab-test (default): runs both halves of the A/B in one command. Three
  unpatched iterations to demonstrate the bug fires deterministically
  (bit-identical NAR hashes), then one patched iteration to demonstrate
  the fix. Prints a side-by-side comparison table and a final PASS/FAIL.

- unpatched-test: just the bug. Sets up the trigger state, rebuilds via
  the system nix-daemon, asserts 1/2526 mismatch + codesign FAIL + SIGKILL.

- patched-test: just the fix. Same trigger, rebuilds via a private daemon
  built from NixOS/nix#15638. Asserts 0/2526 mismatches + codesign PASS +
  fish runs.

All three target the exact same store path
(/nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1) and the exact
same nixpkgs revision (d96b37b). The only variable is the daemon.

A recorded transcript of a passing ab-test run is in
examples/ab-test-output.txt.
@ak2k
Copy link
Copy Markdown
Author

ak2k commented Apr 9, 2026

Thank you, @emilazy. One correction first, then to your three points.

Correction: My statement that "a failed build that left half the outputs behind would reach the same code path" was wrong. Nix's registerOutputs is atomic per-derivation; a failed build doesn't leave one of a multi-output drv's outputs valid in the store. The other trigger scenarios are correct as written: nix-store --delete of one output, or substitution of one output but not the other.

It's expected that local rebuilds of the derivations won't have any issues

Local rebuilds with the specific trigger setup (sibling output already in the store, target output absent, substitution disabled) do reliably exhibit the issue on my machine. I just ran three consecutive iterations on aarch64-darwin (macOS 26.2, unpatched system daemon Nix 2.24.10), same nix-store --delete, same --option substitute false, same nixpkgs d96b37b, same store state (fish-doc present from a prior build):

iteration 1 NAR hash: sha256:1qplch87dy4242vxwi3s5h62m6gnywn0f8z9wf659vkrh6hm4a0g
iteration 2 NAR hash: sha256:1qplch87dy4242vxwi3s5h62m6gnywn0f8z9wf659vkrh6hm4a0g
iteration 3 NAR hash: sha256:1qplch87dy4242vxwi3s5h62m6gnywn0f8z9wf659vkrh6hm4a0g

Bit-identical between runs. Both produced codesign: invalid signature, exit-137 SIGKILL, and 1/2526 mismatches at page 1872 @ 0x00750000, the same single-slot mismatch as the cache.nixos.org artifact at the same store path.

since it's apparently nondeterministic

Seemingly nondeterministic in production, yes — but the trigger setup above is at least one state configuration under which it reproduces deterministically. As I read the bug I've verified, it's state-dependent rather than nondeterministic: it requires the sibling output to be in the store at the moment the build starts. Within that state, on this machine, it fires bit-identically. Outside that state (truly cold store, no sibling present), it doesn't trigger at all. Whether @zhaofengli's reproducible setup converges with this one or describes a distinct mechanism, I haven't traced yet; if there's a pointer to where it lives, I'd be glad to look and compare.

So building a broken‐in‐the‐cache derivation locally and seeing that it doesn't exhibit the issue does not demonstrate that this PR fixes the issue; that's already what we observe with no change.

This isn't the shape of the PR body's A/B. The unpatched local rebuild does exhibit the issue under the trigger setup — the same 1/2526 mismatch and SIGKILL shown above is what the unpatched block in the PR description shows. The patched-daemon rebuild from the same starting state and the same command sequence produces 0/2526 and runs. Same store path, same nixpkgs revision, same trigger condition; the only variable I changed is the daemon. So the comparison isn't "broken cache vs incidentally-clean local rebuild" — it's "broken cache, broken unpatched local rebuild under the trigger, clean patched local rebuild under the same trigger."

If you'd like to test on your machine, I've packaged the described trigger setup as a flake at https://github.com/ak2k/nix-507531-repro:

nix run github:ak2k/nix-507531-repro

On aarch64-darwin it runs the unpatched rebuild three times (to show the bug fires deterministically across iterations), runs the patched rebuild once via a private daemon built from this PR's commit, then prints a side-by-side comparison table and a final PASS/FAIL line. ~5–7 min wall time, one sudo prompt for the patched daemon spawn. A recorded transcript of a passing run is at examples/ab-test-output.txt if you'd rather read than run.

If on your darwin machine the corruption reproduces on the unpatched daemon under this trigger, the A/B in the PR body holds and the patched daemon's fix applies under the same trigger. If it doesn't reproduce, that would point to a second state factor I haven't isolated here, distinct from the trigger above. Either way, the bug is independently visible: cache.nixos.org is serving a fish that macOS refuses to execute, the unpatched local rebuilds above reproduce the same single-slot mismatch deterministically, and #507531 / #208951 seem to collect related reports from darwin users.

One epistemic caveat I should make: the observable here — a single SHA-256 page hash slot mismatch in a linker-signed CodeDirectory — is narrow enough that two distinct mechanisms could in principle produce indistinguishable outputs, and codesign --verify plus the kernel's page-in check can't tell them apart. The claim I can make is that this PR fixes the mechanism I've reproduced; the claim I can't make is that it fixes every report regardless of mechanism, since some of those could in principle have a distinct cause that converges on the same symptom. If there's something concrete pointing at a separate mechanism in some of those reports, I'd be glad to dig in.

@ak2k
Copy link
Copy Markdown
Author

ak2k commented Apr 9, 2026

Thanks @emilazy — your parenthetical caught a real narrowness in how I'd been framing this. Two refinements after looking more carefully at the mechanism and at Hydra's source.

1. The trigger isn't specifically "sibling present"; it's "any output being built has scratchPath != finalPath", which happens whenever that output was in the store at build start. In src/libstore/unix/build/derivation-builder.cc at this PR's base a37db9d24, scratch-path selection routes any already-present output through makeFallbackPath(status.known->path) at L798; the finish lambda at L1614 populates outputRewrites with a scratchHash → finalHash entry for every such output; and the rewriteOutput lambda at L1634 applies those rewrites via RewritingSink. Two variants:

  • Sibling-reference: X is being built, sibling Y is present, X embeds Y's path. fish hits this: fish-4.2.1/bin/fish embeds its fish-4.2.1-doc sibling's share/doc/fish path as a literal string in __TEXT,__cstring.
  • Self-reference: X is being built, X itself is already present at build start (forcing a fallback scratch path for X), X embeds $out-derived self-references. This fires under --rebuild / --repair-path specifically — a wanted=false already-valid output otherwise skips rewriteOutput via the AlreadyRegistered path at L1484, so the self-ref rewrite on X's own new bytes only runs when the rebuild explicitly re-enters rewriteOutput. zsh hits this under the PR's --rebuild functional test: zsh-5.9/bin/zsh has no sibling runtime references at all, but three self-references to /nix/store/s07v...zsh-5.9/{share/zsh/5.9/functions,lib/zsh/5.9,etc/zshenv}, all inside the flags=0x20002(adhoc,linker-signed) CodeDirectory.

The sibling/self split is the same distinction @Ericson2314 drew from the architectural direction above"just not have self references in the binary" — reached here from the source side.

2. The trigger state is not rare on Hydra; it's a consequence of per-output substitution. My earlier examples (nix-store --delete of one output, substitution of one output but not the other) sounded narrow because I was thinking about end-user actions. Looking at Hydra's queue-runner source, the routes are routine:

  • In subprojects/hydra-queue-runner/src/state/mod.rs at cd235f7, the master computes the set of missing outputs for a drv (L1788–L1815), then fans them out per output via substitute_output in a buffer_unordered(10) stream. A single failed substitute (transient network, partial cache state, S3 inconsistency) leaves the drv with some outputs substituted and others missing. The drv is then scheduled for a local build, with the partial state in place.
  • The builder side does the same on its own store: substitute_paths loops ensure_path per path with the same partial-failure mode.
  • Long-running workers: a worker that previously built package Y (where Y runtime-depends on D.out) ends up with D.out in its store. When Hydra later asks that worker to build D itself, D.out is present → fallback scratch path → rewrite fires in the newly-built bytes.
  • GC asymmetry: workers GC on runtime-root reachability. For multi-output drvs where out is rooted but siblings aren't, the siblings get collected while out stays (or vice-versa for the rarer case).

None of these require "failed build that left some outputs"; they're routine consequences of Hydra's per-output substitution design plus long-running worker store state. For a given staging-next rebuild, the probability that a multi-output drv has at least one output present when its sibling needs rebuilding seems high enough that I don't think it would need to be a rare coincidence to explain the observed rate.

Quick pattern-check against the heavily-mentioned packages in the threads, verified in-situ on this machine via strings and otool -L. Sibling-reference: fish (bin/fish embeds its fish-4.2.1-doc sibling path as a cstring literal in __TEXT,__cstring), git (bin/git embeds its git-2.51.2-doc sibling path as a cstring literal), curl (-bin's bin/curl references its out sibling's libcurl.4.dylib via an LC_LOAD_DYLIB load command — same rewrite mechanism, different Mach-O section; the trigger isn't specific to cstring literals, it fires on any byte covered by a page hash). Self-reference: zsh (bin/zsh has three $out-derived self-references, no sibling refs), bash (bin/bash has one $out-derived self-reference, no sibling refs). gitFull isn't built locally so I haven't traced it, but it's the same derivation family as git.

If you or @zhaofengli have a reproduction that doesn't fit this trigger, I'd want to look at it.

JacobPEvans added a commit to JacobPEvans/nix-darwin that referenced this pull request Apr 10, 2026
Add gh-restricted, gh-private, gh-admin functions that switch
GITHUB_TOKEN by reading tiered PATs from macOS Keychain. Defaults
to restricted on shell startup; escalation gated by keychain password.

Restricted uses automation.keychain-db (AI accessible).
Private and admin use elevate-access.keychain-db (user unlock required).

Centralizes token configuration in lib/user-config.nix under
github.tokens with per-tier service + keychain attributes for DRY.

Includes temporary direnv darwin overlay tracking NixOS/nix#6065:
Mach-O signature corruption causes fish test SIGKILL. Remove when
NixOS/nix#15638 lands.
JacobPEvans added a commit to JacobPEvans/nix-darwin that referenced this pull request Apr 10, 2026
* feat: tiered GitHub token context switching

Add gh-restricted, gh-private, gh-admin functions that switch
GITHUB_TOKEN by reading tiered PATs from macOS Keychain. Defaults
to restricted on shell startup; escalation gated by keychain password.

Restricted uses automation.keychain-db (AI accessible).
Private and admin use elevate-access.keychain-db (user unlock required).

Centralizes token configuration in lib/user-config.nix under
github.tokens with per-tier service + keychain attributes for DRY.

Includes temporary direnv darwin overlay tracking NixOS/nix#6065:
Mach-O signature corruption causes fish test SIGKILL. Remove when
NixOS/nix#15638 lands.

* refactor: improve gh-token-switching error handling and cleanup

- gh-token-switching.zsh: call security directly to distinguish missing
  entries from locked/access-denied/empty failures; route errors to stderr;
  add REQUIRES contract comment listing expected env vars
- home.nix: unset _get_keychain_secret and _KC_AI_DB after init since the
  switching functions no longer need them at runtime
- direnv-darwin-fix.nix: use lib.optionalAttrs instead of if/then/else
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation with-tests Issues related to testing. PRs with tests have some priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Content-addressed derivation fails to build on aarch64-darwin

3 participants