libstore: Bit-reproducibly fix darwin Mach-O page hashes after rewriting#15638
libstore: Bit-reproducibly fix darwin Mach-O page hashes after rewriting#15638ak2k wants to merge 1 commit intoNixOS:masterfrom
Conversation
ad67af6 to
12dde40
Compare
12dde40 to
d5db6f1
Compare
|
To be clear, this not about placeholders, but scratch paths, right? |
d5db6f1 to
789046f
Compare
|
Yes, thank you for the catch. Amending the PR description, commit message, and |
This is based on the work of @andrewgazelka in NixOS#14999. On darwin, `DerivationBuilderImpl::registerOutputs` calls `RewritingSink` to substitute scratch-path bytes in build outputs. When one output's binary embeds a sibling output's store path via `${builtins.placeholder "doc"}` and the sibling is already in the store at build start, the embedded bytes are the sibling's scratch path (from `makeFallbackPath`), which `registerOutputs` rewrites to the final hash after the builder exits. The substitution is byte-level and has no knowledge of Mach-O code signatures, but Apple's `ld` ad-hoc-signs every binary at link time with the `linker-signed` flag set in `LC_CODE_SIGNATURE`. The signature covers the very bytes that were just rewritten, so one or more SHA-256 page hashes in the `CodeDirectory` are stale after the rewrite, and the macOS kernel SIGKILLs the binary at first page-in with `cs_invalid_page`. This surfaces in nixpkgs as NixOS/nixpkgs#507531 (fish on `nixpkgs-darwin` fails to start). Add a darwin-only helper, `fixupMachoPageHashes`, called from the `rewriteOutput` lambda after `movePath` and before `canonicalisePathMetaData`. The helper recomputes only the affected page-hash slots in place, leaving every other byte of the Mach-O unchanged, including the `linker-signed` flag, the original 4 KiB page size, the identifier, and every special slot. The fix-up is length-preserving, so the result is bit-identical to a clean build of the same input. The same call site covers both `InputAddressed` and `CAFloating`/ `CAFixed`/`Impure` visitors, since they all go through the shared `rewriteOutput` lambda. Differences from NixOS#14999: (a) cover the InputAddressed call site, which is what bites fish on `nixpkgs-darwin`; (b) preserve bit-reproducibility so `nix build --check` does not fire on the rewrite alone; (c) preserve `linker-signed` and the 4 KiB page size, both of which `codesign(1)` would clear. Closes: NixOS#6065
789046f to
883e433
Compare
|
So I'll admit my current plan is to... just not have self references in the binary. How important is it to have paths like these? (For other-output references, the idea is that you imperatively registered outputs, so you could do this sort of thing manually with Nix helping. E.g. install lib output, get store path back, use it in bin output.) |
I doubt it: I expect this is an instance of the long-standing NixOS/nixpkgs#208951 bug, which is cursed and nondeterministic (fresh rebuilds tend to fix it and @zhaofengli found that it even depends on whether a machine has built the derivation before at all). Since (And TBH encoding Mach-O knowledge deep in Nix guts is a pretty awful layering violation, there are avenues to fix the rewriting issue that don't require that.) |
Makes sense, and seems like the vastly superior architecture move. I wonder if something like this PR offers a short-term bridge for the existing darwin breakage (nixpkgs#507531, nixpkgs#208951) in the interim? |
|
I don't see evidence that this PR fixes those issues. |
|
Thank you, @emilazy. I'll try to address those three: On whether this PR addresses the root cause of NixOS/nixpkgs#507531 / NixOS/nixpkgs#208951. You observed in nix-darwin#693 comment 38: "The What this PR addresses is that specific mechanism — scratch-path → final-path byte substitution inside Mach-O code pages already covered by This also explains the apparent nondeterminism that's made the bug hard to pin down. The trigger is hidden state: whether a sibling output happened to be in the store when the build started. On a fresh machine building an IA multi-output derivation cold, both outputs come out of the same builder run, neither is in the store yet, On the test methodology. I've added a non- On the layering violation. I agree — Mach-O parsing logic in Short of those, both #14999 and this PR fix the corruption from inside the daemon, via different mechanisms. #14999 (@andrewgazelka) shells out to |
I don’t think these correspond to the circumstances in which we see Hydra produce these broken outputs. It’s expected that local rebuilds of the derivations won’t have any issues, since it’s apparently nondeterministic and seemingly partially dependent on persistent system state of some kind. So building a broken‐in‐the‐cache derivation locally and seeing that it doesn’t exhibit the issue does not demonstrate that this PR fixes the issue; that’s already what we observe with no change. I know @zhaofengli had a somewhat reproducible test setup, but it was difficult to arrange. |
|
(For clarity: the sibling outputs thing is an interesting observation that I can imagine might have something to do with what we’re seeing here, but given the amount of times we see random stuff in |
Three darwin-only flake apps targeting aarch64-darwin: - ab-test (default): runs both halves of the A/B in one command. Three unpatched iterations to demonstrate the bug fires deterministically (bit-identical NAR hashes), then one patched iteration to demonstrate the fix. Prints a side-by-side comparison table and a final PASS/FAIL. - unpatched-test: just the bug. Sets up the trigger state, rebuilds via the system nix-daemon, asserts 1/2526 mismatch + codesign FAIL + SIGKILL. - patched-test: just the fix. Same trigger, rebuilds via a private daemon built from NixOS/nix#15638. Asserts 0/2526 mismatches + codesign PASS + fish runs. All three target the exact same store path (/nix/store/gngn7y9mn510mf1hkmr0l69qbpvxfbfh-fish-4.2.1) and the exact same nixpkgs revision (d96b37b). The only variable is the daemon. A recorded transcript of a passing ab-test run is in examples/ab-test-output.txt.
|
Thank you, @emilazy. One correction first, then to your three points. Correction: My statement that "a failed build that left half the outputs behind would reach the same code path" was wrong. Nix's
Local rebuilds with the specific trigger setup (sibling output already in the store, target output absent, substitution disabled) do reliably exhibit the issue on my machine. I just ran three consecutive iterations on Bit-identical between runs. Both produced
Seemingly nondeterministic in production, yes — but the trigger setup above is at least one state configuration under which it reproduces deterministically. As I read the bug I've verified, it's state-dependent rather than nondeterministic: it requires the sibling output to be in the store at the moment the build starts. Within that state, on this machine, it fires bit-identically. Outside that state (truly cold store, no sibling present), it doesn't trigger at all. Whether @zhaofengli's reproducible setup converges with this one or describes a distinct mechanism, I haven't traced yet; if there's a pointer to where it lives, I'd be glad to look and compare.
This isn't the shape of the PR body's A/B. The unpatched local rebuild does exhibit the issue under the trigger setup — the same If you'd like to test on your machine, I've packaged the described trigger setup as a flake at https://github.com/ak2k/nix-507531-repro: On If on your darwin machine the corruption reproduces on the unpatched daemon under this trigger, the A/B in the PR body holds and the patched daemon's fix applies under the same trigger. If it doesn't reproduce, that would point to a second state factor I haven't isolated here, distinct from the trigger above. Either way, the bug is independently visible: One epistemic caveat I should make: the observable here — a single SHA-256 page hash slot mismatch in a |
|
Thanks @emilazy — your parenthetical caught a real narrowness in how I'd been framing this. Two refinements after looking more carefully at the mechanism and at Hydra's source. 1. The trigger isn't specifically "sibling present"; it's "any output being built has
The sibling/self split is the same distinction @Ericson2314 drew from the architectural direction above — "just not have self references in the binary" — reached here from the source side. 2. The trigger state is not rare on Hydra; it's a consequence of per-output substitution. My earlier examples (
None of these require "failed build that left some outputs"; they're routine consequences of Hydra's per-output substitution design plus long-running worker store state. For a given staging-next rebuild, the probability that a multi-output drv has at least one output present when its sibling needs rebuilding seems high enough that I don't think it would need to be a rare coincidence to explain the observed rate. Quick pattern-check against the heavily-mentioned packages in the threads, verified in-situ on this machine via If you or @zhaofengli have a reproduction that doesn't fit this trigger, I'd want to look at it. |
Add gh-restricted, gh-private, gh-admin functions that switch GITHUB_TOKEN by reading tiered PATs from macOS Keychain. Defaults to restricted on shell startup; escalation gated by keychain password. Restricted uses automation.keychain-db (AI accessible). Private and admin use elevate-access.keychain-db (user unlock required). Centralizes token configuration in lib/user-config.nix under github.tokens with per-tier service + keychain attributes for DRY. Includes temporary direnv darwin overlay tracking NixOS/nix#6065: Mach-O signature corruption causes fish test SIGKILL. Remove when NixOS/nix#15638 lands.
* feat: tiered GitHub token context switching Add gh-restricted, gh-private, gh-admin functions that switch GITHUB_TOKEN by reading tiered PATs from macOS Keychain. Defaults to restricted on shell startup; escalation gated by keychain password. Restricted uses automation.keychain-db (AI accessible). Private and admin use elevate-access.keychain-db (user unlock required). Centralizes token configuration in lib/user-config.nix under github.tokens with per-tier service + keychain attributes for DRY. Includes temporary direnv darwin overlay tracking NixOS/nix#6065: Mach-O signature corruption causes fish test SIGKILL. Remove when NixOS/nix#15638 lands. * refactor: improve gh-token-switching error handling and cleanup - gh-token-switching.zsh: call security directly to distinguish missing entries from locked/access-denied/empty failures; route errors to stderr; add REQUIRES contract comment listing expected env vars - home.nix: unset _get_keychain_secret and _KC_AI_DB after init since the switching functions no longer need them at runtime - direnv-darwin-fix.nix: use lib.optionalAttrs instead of if/then/else
Based on the work of @andrewgazelka in #14999. Happy to close this and fold into that branch instead if @andrewgazelka prefers; see the Context section for the technical differences.
Motivation
On darwin,
DerivationBuilderImpl::registerOutputscallsRewritingSinkto substitute scratch-path bytes in build outputs. When a sibling output is already in the store at build start, its scratch path is amakeFallbackPath-synthesised stand-in, andRewritingSinkrewrites those scratch-path bytes to the sibling's final path after the builder exits. The substitution is byte-level and has no knowledge of Mach-O code signatures, but Apple'sldad-hoc-signs every binary at link time with thelinker-signedflag set inLC_CODE_SIGNATURE. The signature covers the very bytes that were just rewritten, so one or more SHA-256 page hashes in theCodeDirectoryare stale after the rewrite. At first page-in, the macOS kernel SIGKILLs the process withcs_invalid_page.This is the root cause of NixOS/nixpkgs#507531 (fish on
nixpkgs-darwinfails to start) and NixOS/nix#6065 (open since 2022 against CA derivations; same mechanism, wider surface).Context
Add a darwin-only helper,
fixupMachoPageHashes, called from therewriteOutputlambda aftermovePathand beforecanonicalisePathMetaDatainderivation-builder.cc. The helper recomputes only the affected page-hash slots in place, leaving every other byte of the Mach-O unchanged, including thelinker-signedflag, the original 4 KiB page size, the identifier, and every special slot. The fix-up is length-preserving, so the result is bit-identical to a clean build of the same input.The single call site covers both
InputAddressedandCAFloating/CAFixed/Impurevisitors, since they all go through the sharedrewriteOutputlambda.Differences from #14999
darwin-codesign.cc)rewriteOutputlambda → both IA and CACodeDirectoryflagslinker-signedcleared bycodesign -s -codesignadds special slots (non-minimalCodeDirectory)linker-signed)codesign -P 4096)/usr/bin/codesignper Mach-O fileVerification
Run against a patched daemon on
aarch64-darwinmacOS 26.2, using a standalone synthetic reproducer (a multi-outputstdenv.mkDerivationwhosebin/helloembeds${builtins.placeholder "doc"}).Verification script (
check.py, Python stdlib only)The reproductions below use a small Python script to report
nCodeSlots,pageSize,codeLimit, and per-slot SHA-256 mismatches againstLC_CODE_SIGNATURE.CodeDirectory. It is standalone Python 3 with no third-party dependencies, so it runs on any macOS with system Python. Save ascheck.pyand invoke aspython3 check.py <binary>.Assumes a thin (single-arch) Mach-O; none of the reproducers below produce fat binaries on
aarch64-darwin.IA cold build (patched daemon)
IA
--rebuild(the bug trigger; patched daemon — full bit-reproducibility achieved)The
.checkdirectory does not exist. The rebuilt NAR is byte-identical to the cold build, so Nix's determinism check passes silently and--rebuildreturns 0 without producing a.checkdivergence directory. The rewrite ran (outputRewriteswas populated becausedocwas already in the store from the cold build), the helper recomputed the affected page hashes in place, and the result is byte-identical because the helper preserves every other byte of the Mach-O including theLC_UUIDpayload.CA cold build (patched daemon — closes #6065)
The CA path triggers the bug on the cold build itself because CA derivations always go through the rewrite path (the scratch hash and the final content-addressed hash always differ). No
--rebuildneeded.Same
pageSize=4096, samenCodeSlots=13, samelinker-signedflag, sameadhocsignature format as the IA case. The "single call site covers both IA and CA" claim is empirically tested, not a code-reading argument. This is the variant that directly exercises #6065, which was filed against CA derivations specifically.Negative control: unpatched daemon (Nix 2.24.10 — has the bug)
To prove the fix is real, the same reproducer was run via an unpatched
nix-daemon(Nix 2.24.10). The cold build succeeds (cold builds never trigger the bug), but--rebuildmaterialises the corrupted binary in a.checkdirectory:Same
nCodeSlots=13and same affected page (slot 3, file offset 0x3000) as the patched daemon's cold build. This confirms three things in one negative control: (1) the bug fires reliably on Nix 2.24.10; (2) the rewrite affects exactly one page, exactly the page our analysis predicted; (3) the patched daemon's cold-build slot layout is bit-identical to the unpatched daemon's — the helper preserves the originalCodeDirectorystructure entirely.Non-
--checkreproduction on production fish (no determinism check involved)This demonstrates that the mechanism fires on an ordinary
nix buildwhenever a sibling output is present in the store at build start. No--check, no--rebuild, no special flags —nix-store --delete <fish-out>removes the binary while leavingfish-docin the store, then an ordinarynix buildreproduces the corruption.The trigger is
fish-docbeing in the store at build start:outputRewritesis populated withfish-doc's scratch→final hash mapping,RewritingSinksubstitutes those bytes insidebin/fish's__TEXT,__cstringpage at file offset 0x750000, and the corresponding page hash slot inLC_CODE_SIGNATURE.CodeDirectory(slot 1872, the only one of 2526 that mismatches) becomes stale. The kernel SIGKILLs at first page-in.Non-
--checkreproduction on production fish, with the patched daemon (positive result — closes nixpkgs#507531 on the real package)Same sequence as the previous block, same nixpkgs revision, same trigger condition (
fish-docis in the store at build start; the bin output is deleted). The only variable is the daemon, which carries this PR's commit883e43319.Where the previous block produced
1/2526 mismatches at page 1872and a SIGKILL at page-in, this block produces0/2526and a runnable binary. TheCodeDirectoryhasflags=0x20002 (adhoc,linker-signed)(onlyldcan set thelinker-signedbit;codesign -f -s -clears it),hashes=2526+0(minimal special-slot layout matchingld's original, not the non-minimal layout thatcodesign -f -s -produces), andpageSize=4096(matchingld -adhoc_codesign's default). Same store path, same nixpkgs revision, same trigger condition — the only variable is the daemon.A note on byte-level non-determinism across rebuilds. Independent patched-daemon rebuilds of fish 4.2.1 produce distinct NAR hashes, because fish's own build is not bit-reproducible across runs independent of this fix. On the host where this block was produced, two cold builds with no sibling in the store — builds where the post-rewrite helper never runs — differ from each other by hundreds of thousands of bytes, almost entirely in Rust-generated codegen sections and Apple
ld's per-invocationLC_UUIDpayload. The helper is a surgical in-place update to theCodeDirectory's page hash slots and cannot introduce variability to code pages, DWARF sections, orLC_UUID. The bit-reproducibility claim of this PR is demonstrated against the reproducible synthetichello-multireproducer in the blocks above (where no Rust/sphinx variability applies); fish is the runtime-correctness validation against the actual package from the original reports.Source-level trace (Nix master
a37db9d24, the base commit of this PR)The
rewriteOutputlambda —derivation-builder.cc:1634-1662. WrapsRewritingSinkarounddumpPath(actualPath), streams throughrestorePathinto<path>.tmp, thendeletePath,movePath, andcanonicalisePathMetaData. Nothing in this lambda has any awareness of Mach-O or code signatures; the substitution is byte-level and format-agnostic.The
RewritingSink::operator()implementation —references.cc:72-89. PurerewriteStrings(s, rewrites)with an equal-length assertion at construction — no file-format awareness, no signing knowledge.The InputAddressed visitor —
derivation-builder.cc:1784-1788:hashPathis the immediately-next call afterrewriteOutputwith nothing between them that could re-sign the file. Any corruption introduced by the substitution is locked into the NAR hash on the next line.The CA visitor path —
newInfoFromCAatderivation-builder.cc:1689, with the firstrewriteOutputcall at line 1705. Same lambda, same byte substitution, reached via theCAFloating/CAFixed/Impurevisitors.Because (1) is the shared lambda called from both (3) and (4), the helper this PR adds is inserted once inside (1) and covers both IA and CA visitor paths in a single call site.
On why the mechanism fires on darwin and not Linux:
needsHashRewrite()returnstrueunconditionally in the base class — on Linux with chroot it's overridden to returnfalseandoutputRewritesstays empty for already-known paths. On darwin, it returnstrue, so whenever a sibling output is already present in the store at build start, that output's scratch path becomes amakeFallbackPath-synthesised stand-in, populatingoutputRewriteswith ascratchHash → finalHashentry. When the other output runs through therewriteOutputlambda above, that entry is what gets substituted into its binary — and when that binary is ald -adhoc_codesign-signed Mach-O, the substituted bytes fall inside a page already covered by alinker-signedSHA-256 hash inLC_CODE_SIGNATURE.Out of scope (follow-ups)
#ifdef __APPLE__out of the sharedrewriteOutputlambda via a virtual hook onDarwinDerivationBuilder.Error; modern macOS ships 32-bit fat).CodeDirectorytest fixture.