Skip to content

ci: split docker-publish per-arch runners + cache-friendly dockerfile layers#22080

Merged
ethernet8023 merged 5 commits into
mainfrom
fix/faster-docker
May 8, 2026
Merged

ci: split docker-publish per-arch runners + cache-friendly dockerfile layers#22080
ethernet8023 merged 5 commits into
mainfrom
fix/faster-docker

Conversation

@ethernet8023

Copy link
Copy Markdown
Collaborator

Cuts Docker Hub publish time from ~40 min to ~3 min on warm cache (and ~13 min on cold cache) by splitting the per-arch builds onto native runners and restructuring the Python dep install into a cache-friendly layer.

Before: one ubuntu-latest job built both arches via QEMU emulation. Every main push took 38-45 min, with arm64 eating ~80% of the wall clock because it ran under emulation and shared a gha cache scope with amd64, so the two arches clobbered each other's layer cache between runs.

After: three jobs run in parallel — build-amd64 on ubuntu-latest, build-arm64 on ubuntu-24.04-arm (GitHub's free native arm64 runner, no QEMU), and merge that stitches the per-arch digests into a single multi-arch manifest using docker buildx imagetools create. Cache scopes are separated per-arch (scope=docker-amd64 / scope=docker-arm64), and the Dockerfile's Python dep install was hoisted above COPY . . so source-only commits skip the ~4-5 min dep resolve entirely.

All existing safety behavior is preserved: per-commit sha-<sha> tags, the org.opencontainers.image.revision OCI label, the dashboard subcommand smoke test (#9153 regression guard), and the race-safe :latest advancement via the move-latest job.

Related Issue

Fixes #

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • .github/workflows/docker-publish.yml — replaced the single build-and-push job with four: build-amd64 (native, runs smoke tests + dashboard --help regression guard, pushes by digest), build-arm64 (native on ubuntu-24.04-arm, pushes by digest), merge (stitches digests into :sha-<sha> on main or :<release_tag> on release), and move-latest (unchanged ancestor-check logic, now gated on needs: merge). Cache scoped per-arch. Top-level cancel-in-progress: false preserved.
  • .github/workflows/docker-publish.yml — flipped move-latest's own concurrency to cancel-in-progress: false for defense-in-depth. The top-level concurrency group already serializes runs for the ref, so the old cancel=true on move-latest was dead code; if top-level is ever loosened, queued move-latests will now run serially in arrival order instead of cancelling each other. Updated the comment block to describe the real serialization source honestly.
  • Dockerfile — split the Python dep install into a cached layer above COPY . .. Before: uv pip install -e ".[all]" ran after COPY . ., so every .py change re-resolved ~258 packages. After: uv sync --frozen --no-install-project --extra all runs on just pyproject.toml + uv.lock, then uv pip install --no-cache-dir --no-deps -e "." creates the editable link in ~1s after the source copy. Uses --extra all (the composite extra intended for production) rather than --all-extras (would pull in [rl], [yc-bench], [termux-all] — git-cloned RL libs, benchmarks, Android redundancy that don't belong in the published image).
  • .github/workflows/uv-lockfile-check.yml — new blocking CI check that runs uv lock --check on PRs touching pyproject.toml / uv.lock. Since the Docker build now uses uv sync --frozen, a stale lockfile would fail the docker-publish workflow on main ~15 min into the build with no published image. This check catches that in ~10s at PR time, with a step summary telling the dev exactly which commands to run locally to fix it.
  • uv.lock — refreshed to match pyproject.toml (separate commit, pre-existing drift picked up by the new check).

How to Test

Verified via five manual workflow_dispatch runs on this branch (a temporary dispatch trigger + dryrun-<sha> tag scheme was used during development; both were dropped from the final history). All five runs succeeded end-to-end, produced a valid multi-arch manifest, and correctly skipped move-latest (workflow_dispatch can't touch :latest — triple-gated via event_name == 'push' + ref == 'refs/heads/main' + the pushed_sha_tag output which only gets set on push-to-main).

run scenario build-amd64 build-arm64 total wall
baseline (main, today) single runner + QEMU 38-45 min
1 per-arch split, cold cache 12m 36s 11m 18s ~13 min
2 per-arch split, warm cache 5m 30s 7m 54s ~8m 20s
3 + dockerfile layer, buggy --all-extras 18m 21s 13m 41s ~19 min ❌
4 + dockerfile layer, --extra all fix, cold 7m 1s 16m 9s ~16m 30s
5 + dockerfile layer, warm cache 2m 53s 26s 🚀 ~3m 17s

Run 3 surfaced the --all-extras bloat bug — caught in dry-run before merge. Run 5 is the target steady state: on a source-only commit (no pyproject.toml change, cache populated), the whole pipeline finishes in ~3 minutes.

Post-merge verification steps:

  1. Wait for the first real push to main that triggers this workflow. Confirm total wall clock is in the 12-18 min range on cold cache (new cache scopes will be empty at first).
  2. After that lands, the next push with source-only changes should complete in <5 min.
  3. Verify :latest points at the merge commit:
    docker buildx imagetools inspect nousresearch/hermes-agent:latest \
      --format '{{ json (index .Image "linux/amd64") }}' \
      | jq -r '.config.Labels."org.opencontainers.image.revision"'
  4. Pull and smoke-test both platforms on hosts that have them:
    docker pull --platform linux/amd64 nousresearch/hermes-agent:latest
    docker run --rm nousresearch/hermes-agent:latest hermes --help
    docker pull --platform linux/arm64 nousresearch/hermes-agent:latest
    docker run --rm nousresearch/hermes-agent:latest hermes --help
  5. Test the new uv-lockfile-check job by opening a throwaway PR that adds a dep to pyproject.toml without regenerating uv.lock. The check should fail with a clear step summary.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass — N/A (CI-only change, no Python runtime code touched; the test suite doesn't exercise GitHub Actions workflows)
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features) — N/A, but verified via 5 live dry-run workflow executions (see timing table above)
  • I've tested on my platform: GitHub-hosted ubuntu-latest + ubuntu-24.04-arm runners (verified via workflow_dispatch on this branch)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — N/A, workflow and Dockerfile comments are thorough
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A (no architectural change; CI workflow modification only)
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — N/A (CI-only change, targets GitHub's Linux runners; the published image already supported amd64 + arm64)
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

Dry-run workflow runs on this branch (workflow_dispatch trigger + dryrun-<sha> tag scheme dropped from final history):

run description link
1 cold, per-arch split only https://github.com/NousResearch/hermes-agent/actions/runs/25575794699
2 warm, per-arch split only https://github.com/NousResearch/hermes-agent/actions/runs/25576643168
3 + dockerfile layer, buggy --all-extras (caught in dry-run) https://github.com/NousResearch/hermes-agent/actions/runs/25577579491
4 + dockerfile layer, --extra all fix, cold https://github.com/NousResearch/hermes-agent/actions/runs/25578526011
5 + dockerfile layer, --extra all fix, warm https://github.com/NousResearch/hermes-agent/actions/runs/25579260593

Multi-arch manifest from run 4 (pre-squash dryrun tag, same schema production will produce):

$ skopeo inspect --raw docker://docker.io/nousresearch/hermes-agent:dryrun-1174fd4ff4a5bd022846f1c7ee6277221a7c2059
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "manifests": [
    { "digest": "sha256:883a7ea8...", "platform": { "architecture": "arm64", "os": "linux" } },
    { "digest": "sha256:5dbc97bc...", "platform": { "architecture": "unknown", "os": "unknown" },
      "annotations": { "vnd.docker.reference.type": "attestation-manifest" } },
    { "digest": "sha256:a95c86b7...", "platform": { "architecture": "amd64", "os": "linux" } },
    { "digest": "sha256:e80bfea6...", "platform": { "architecture": "unknown", "os": "unknown" },
      "annotations": { "vnd.docker.reference.type": "attestation-manifest" } }
  ]
}

Both linux/amd64 and linux/arm64 sub-manifests are present, plus SLSA build attestations for each.

Note: a handful of dryrun-<sha> tags exist on Docker Hub from the dry runs. They're immutable digest-addressed images, harmless to leave but safe to delete after merge if desired.

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/faster-docker vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7822 on HEAD, 7822 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4121 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Build amd64 and arm64 natively on their own GitHub runners in
parallel, then stitch the per-arch digests into a tagged multi-arch
manifest.  Replaces the previous single-runner pattern which rebuilt
arm64 from scratch on every run because QEMU emulation + unscoped GHA
cache meant no layer reuse across invocations.

Jobs:
  build-amd64 — ubuntu-latest, native, runs smoke tests, pushes by
digest
  build-arm64 — ubuntu-24.04-arm, native (no QEMU), pushes by digest
  merge       — stitches both digests into :sha-<sha> (main) or
:<release>
  move-latest — unchanged ancestor-check logic, now needs: merge

Preserved:
  - per-commit sha-<sha> tags on main (immutable, race-free)
  - org.opencontainers.image.revision label on each per-arch image
  - dashboard subcommand smoke test (#9153 guard)
  - race-safe :latest advancement via move-latest
  - top-level cancel-in-progress: false

Changed behavior:
  - move-latest flipped to cancel-in-progress: false for
defense-in-depth.
    Top-level concurrency already serializes runs for the ref, so the
old
    cancel=true on move-latest was dead code.  Flipping to false
prevents
    any starvation mode if top-level is ever loosened.

Cache scopes separated per-arch (scope=docker-amd64 /
scope=docker-arm64)
so the two runners don't clobber each other in the gha cache backend.
Before this change, `uv pip install -e ".[all]"` ran AFTER `COPY . .`,
so every commit that changed any .py file busted the layer cache and
re-did the entire Python dep resolve + wheel download + native extension
compile (~4-5 min on cold Docker Hub cache).

Split it into two steps:

1. Before `COPY . .`: copy only pyproject.toml + uv.lock + README.md,
   then `uv sync --frozen --no-install-project --all-extras`.  This
   layer is cached unless any of those three files change, so .py-only
   commits skip the heavy work entirely.
2. After `COPY . .` (and its downstream chmod/chown step): run
   `uv pip install --no-cache-dir --no-deps -e .` to create the
   editable link.  With --no-deps this is a ~1s op — no resolution, no
   downloads, no compilation.

Combined with the per-arch runner split in the previous commit, this
should drop cache-hit build times to the sub-5-min range.
Runs `uv lock --check` on every PR and on push to main that touches
pyproject.toml, uv.lock, or this workflow itself.  Exits non-zero if
the lockfile is out of sync with pyproject.toml, blocking the PR
before it can break the Docker build on main.

Rationale: the new Dockerfile layout uses `uv sync --frozen --extra all`,
which rejects stale lockfiles.  Without this guard, a PR that changes
pyproject.toml dependencies but forgets to regenerate uv.lock would
merge fine and then break docker-publish on main (visible only after
~15 min of build time, producing no image).

On failure, the step adds a GitHub annotation and a workflow summary
block with the exact commands to run locally (`uv lock`,
`git add uv.lock`, `git commit`).

Verified locally that:
- Clean tree: `uv lock --check` succeeds (resolves in ~2ms, no work).
- Stale lockfile (added cowsay to pyproject.toml, not in lock): exits 1
  with message 'The lockfile at `uv.lock` needs to be updated'.
Adds `pull_request` trigger to docker-publish.yml so PRs that touch
Dockerfile / docker/ / pyproject.toml / uv.lock / the workflow itself
verify the image builds cleanly before merge.  Previously, Dockerfile
regressions (e.g. a stale uv.lock, a typo'd dep) would only surface
after merge when the docker-publish workflow ran on main.

Build-verify-only on PRs: the per-arch jobs run their `load: true`
build + smoke test, but the push-by-digest + artifact upload steps
remain gated on push-to-main or release.  The `merge` and
`move-latest` jobs stay excluded from PRs by their existing `if:`
gates, so :latest and SHA tags are never touched from PR runs.

Concurrency: PR runs use a PR-scoped group (`docker-<pr_number>`)
with `cancel-in-progress: true` so rapid pushes to the same PR
collapse to the latest commit.  Push/release runs keep
`cancel-in-progress: false` — every merge still gets its own
SHA-tagged image.

Also adds arm64 smoke tests (previously amd64-only): the image is
now built with `load: true` on arm64 too, then `docker run --help` +
`dashboard --help` smoke tests run identically on both arches.  Both
smoke test blocks were extracted into a new composite action at
`.github/actions/hermes-smoke-test` to keep the two jobs DRY.

New files:
  - .github/actions/hermes-smoke-test/action.yml

Modified:
  - .github/workflows/docker-publish.yml
@alt-glitch alt-glitch added type/perf Performance improvement or optimization area/docker Docker image, Compose, packaging P3 Low — cosmetic, nice to have labels May 8, 2026
@ethernet8023 ethernet8023 merged commit d10d19e into main May 8, 2026
18 of 19 checks passed
@ethernet8023 ethernet8023 deleted the fix/faster-docker branch May 8, 2026 23:12
JinyuID pushed a commit to JinyuID/hermes-agent that referenced this pull request May 11, 2026
…cker

ci: split docker-publish per-arch runners + cache-friendly dockerfile layers
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
…cker

ci: split docker-publish per-arch runners + cache-friendly dockerfile layers
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request May 25, 2026
…cker

ci: split docker-publish per-arch runners + cache-friendly dockerfile layers
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…cker

ci: split docker-publish per-arch runners + cache-friendly dockerfile layers
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…cker

ci: split docker-publish per-arch runners + cache-friendly dockerfile layers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docker Docker image, Compose, packaging P3 Low — cosmetic, nice to have type/perf Performance improvement or optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants