ci(docker): use registry-backed build cache for arm64#37129
Merged
Conversation
The arm64 PR build ran fully uncached because the previous gha cache backend's short-lived Azure SAS token expired mid-build on slow cold-cache arm64 runs and crashed before the smoke test. Uncached arm64 PR builds were ~45% slower than amd64 (median 553s vs 382s), making the arm64 job the one most often cancelled on supersede — surfacing as a red X in PR checks and reading as 'the arm64 build keeps failing'. Switch arm64 to a registry-backed cache on ghcr.io (type=registry, ref ghcr.io/nousresearch/hermes-agent:buildcache-arm64). Its credential is the job-lifetime GITHUB_TOKEN, not a time-boxed SAS token, so the cold-build-outlives-token failure mode cannot recur. - PR builds: cache-from only (read-only) — warm layers, no write races, no cache-ref pollution from rapid PR pushes. - main/release builds: cache-from + cache-to (mode=max) to populate the cache for subsequent PR/main builds and let the digest push reuse the smoke-test build's layers. - Add packages: write permission and a ghcr.io login for the cache. amd64 keeps its gha cache: it builds fast enough to stay inside the SAS token's lifetime, so it never hit this failure mode.
Contributor
🔎 Lint report:
|
changman
pushed a commit
to changman/hermes-agent
that referenced
this pull request
Jun 10, 2026
…7129) The arm64 PR build ran fully uncached because the previous gha cache backend's short-lived Azure SAS token expired mid-build on slow cold-cache arm64 runs and crashed before the smoke test. Uncached arm64 PR builds were ~45% slower than amd64 (median 553s vs 382s), making the arm64 job the one most often cancelled on supersede — surfacing as a red X in PR checks and reading as 'the arm64 build keeps failing'. Switch arm64 to a registry-backed cache on ghcr.io (type=registry, ref ghcr.io/nousresearch/hermes-agent:buildcache-arm64). Its credential is the job-lifetime GITHUB_TOKEN, not a time-boxed SAS token, so the cold-build-outlives-token failure mode cannot recur. - PR builds: cache-from only (read-only) — warm layers, no write races, no cache-ref pollution from rapid PR pushes. - main/release builds: cache-from + cache-to (mode=max) to populate the cache for subsequent PR/main builds and let the digest push reuse the smoke-test build's layers. - Add packages: write permission and a ghcr.io login for the cache. amd64 keeps its gha cache: it builds fast enough to stay inside the SAS token's lifetime, so it never hit this failure mode.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A report of the arm64 Docker build "regularly failing" turned out to be cancellation noise, not real failures. Across the last 110 completed `docker-publish` runs:
The single `failure` was a PR-code esbuild error that failed on both arches identically — not arm64-specific. The real signal is that arm64 is cancelled 2.6× more than amd64. In 8 recent runs amd64 succeeded while arm64 was cancelled in the same superseded run. A cancelled job renders as a red ✗ in PR checks → reads as "the arm64 build keeps failing."
Why arm64 is the cancellation casualty
arm64 PR builds run fully uncached. The previous `type=gha` cache was removed from arm64 PRs because cold-cache arm64 builds outlived GitHub's short-lived Azure cache SAS token and crashed on a cache blob op before the smoke test. Uncached → arm64 PR builds are ~45% slower than amd64 (median 553s vs 382s, max 819s), so on fast-iterated branches `cancel-in-progress` kills the still-running arm64 job while amd64 has already finished.
Fix
Switch arm64 to a registry-backed cache on ghcr.io (`type=registry`, ref `ghcr.io/nousresearch/hermes-agent:buildcache-arm64`).
Why this won't repeat the gha failure: the registry cache authenticates with the job-lifetime `GITHUB_TOKEN`, not a time-boxed SAS token minted at job start that expires mid-build. The exact cold-build-outlives-token failure mode that killed `type=gha` on slow arm64 builds cannot recur.
amd64 keeps its gha cache unchanged — it builds fast enough to stay inside the SAS token's lifetime and never hit this failure mode.
Impact case
This is the default path for every arm64 PR build, not a narrow edge case: every contributor iterating on a Docker-touching PR currently sees arm64 finish last and get cancelled on their next push. The fix restores warm-cache speed to that path, which should bring arm64 build time toward amd64's and largely end the spurious red ✗.
Bootstrap note
`buildcache-arm64` doesn't exist until the first main/release build runs `cache-to`. Until then, PR builds hit a missing ref → buildx treats it as a cache miss (warning, not error) and builds cold — i.e. identical to today, no regression before the first main build populates it.
Verification