Skip to content

ci(docker): use registry-backed build cache for arm64#37129

Merged
benbarclay merged 1 commit into
mainfrom
fix/docker-arm64-registry-cache
Jun 2, 2026
Merged

ci(docker): use registry-backed build cache for arm64#37129
benbarclay merged 1 commit into
mainfrom
fix/docker-arm64-registry-cache

Conversation

@benbarclay

Copy link
Copy Markdown
Collaborator

Problem

A report of the arm64 Docker build "regularly failing" turned out to be cancellation noise, not real failures. Across the last 110 completed `docker-publish` runs:

outcome amd64 arm64
success 58 50
cancelled 5 13
failure 1 1

The single `failure` was a PR-code esbuild error that failed on both arches identically — not arm64-specific. The real signal is that arm64 is cancelled 2.6× more than amd64. In 8 recent runs amd64 succeeded while arm64 was cancelled in the same superseded run. A cancelled job renders as a red ✗ in PR checks → reads as "the arm64 build keeps failing."

Why arm64 is the cancellation casualty

arm64 PR builds run fully uncached. The previous `type=gha` cache was removed from arm64 PRs because cold-cache arm64 builds outlived GitHub's short-lived Azure cache SAS token and crashed on a cache blob op before the smoke test. Uncached → arm64 PR builds are ~45% slower than amd64 (median 553s vs 382s, max 819s), so on fast-iterated branches `cancel-in-progress` kills the still-running arm64 job while amd64 has already finished.

Fix

Switch arm64 to a registry-backed cache on ghcr.io (`type=registry`, ref `ghcr.io/nousresearch/hermes-agent:buildcache-arm64`).

Why this won't repeat the gha failure: the registry cache authenticates with the job-lifetime `GITHUB_TOKEN`, not a time-boxed SAS token minted at job start that expires mid-build. The exact cold-build-outlives-token failure mode that killed `type=gha` on slow arm64 builds cannot recur.

  • PR builds: `cache-from` only (read-only) — pull warm layers from the last main build, never write, so rapid PR pushes don't race on cache writes or pollute the cache ref.
  • main/release builds: `cache-from` + `cache-to` (`mode=max`) to populate the cache and let the digest push reuse the smoke-test build's layers.
  • Adds `packages: write` permission + a ghcr.io login for the cache.

amd64 keeps its gha cache unchanged — it builds fast enough to stay inside the SAS token's lifetime and never hit this failure mode.

Impact case

This is the default path for every arm64 PR build, not a narrow edge case: every contributor iterating on a Docker-touching PR currently sees arm64 finish last and get cancelled on their next push. The fix restores warm-cache speed to that path, which should bring arm64 build time toward amd64's and largely end the spurious red ✗.

Bootstrap note

`buildcache-arm64` doesn't exist until the first main/release build runs `cache-to`. Until then, PR builds hit a missing ref → buildx treats it as a cache miss (warning, not error) and builds cold — i.e. identical to today, no regression before the first main build populates it.

Verification

  • `actionlint` clean on the workflow.
  • YAML parses.
  • Behavior verification (real cache hit rate + cancellation-rate drop) requires this to land on `main` so the first `cache-to` populates the ref — flagging that it's a post-merge observation, not something the PR's own CI can prove (PR builds are read-only and the ref is empty until main writes it).

The arm64 PR build ran fully uncached because the previous gha cache
backend's short-lived Azure SAS token expired mid-build on slow
cold-cache arm64 runs and crashed before the smoke test. Uncached arm64
PR builds were ~45% slower than amd64 (median 553s vs 382s), making the
arm64 job the one most often cancelled on supersede — surfacing as a red
X in PR checks and reading as 'the arm64 build keeps failing'.

Switch arm64 to a registry-backed cache on ghcr.io
(type=registry, ref ghcr.io/nousresearch/hermes-agent:buildcache-arm64).
Its credential is the job-lifetime GITHUB_TOKEN, not a time-boxed SAS
token, so the cold-build-outlives-token failure mode cannot recur.

- PR builds: cache-from only (read-only) — warm layers, no write races,
  no cache-ref pollution from rapid PR pushes.
- main/release builds: cache-from + cache-to (mode=max) to populate the
  cache for subsequent PR/main builds and let the digest push reuse the
  smoke-test build's layers.
- Add packages: write permission and a ghcr.io login for the cache.

amd64 keeps its gha cache: it builds fast enough to stay inside the SAS
token's lifetime, so it never hit this failure mode.
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/docker-arm64-registry-cache vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9602 on HEAD, 9602 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4976 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@benbarclay benbarclay merged commit 40ae170 into main Jun 2, 2026
22 of 23 checks passed
@benbarclay benbarclay deleted the fix/docker-arm64-registry-cache branch June 2, 2026 04:03
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…7129)

The arm64 PR build ran fully uncached because the previous gha cache
backend's short-lived Azure SAS token expired mid-build on slow
cold-cache arm64 runs and crashed before the smoke test. Uncached arm64
PR builds were ~45% slower than amd64 (median 553s vs 382s), making the
arm64 job the one most often cancelled on supersede — surfacing as a red
X in PR checks and reading as 'the arm64 build keeps failing'.

Switch arm64 to a registry-backed cache on ghcr.io
(type=registry, ref ghcr.io/nousresearch/hermes-agent:buildcache-arm64).
Its credential is the job-lifetime GITHUB_TOKEN, not a time-boxed SAS
token, so the cold-build-outlives-token failure mode cannot recur.

- PR builds: cache-from only (read-only) — warm layers, no write races,
  no cache-ref pollution from rapid PR pushes.
- main/release builds: cache-from + cache-to (mode=max) to populate the
  cache for subsequent PR/main builds and let the digest push reuse the
  smoke-test build's layers.
- Add packages: write permission and a ghcr.io login for the cache.

amd64 keeps its gha cache: it builds fast enough to stay inside the SAS
token's lifetime, so it never hit this failure mode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant