[docker] Fix silently-masked cubin download failure; skip prebuilt cubins on aarch64#24234
Conversation
…bins on aarch64 The framework_final RUN block that downloads sgl-kernel cubins ends with `... || true` (intended for the trailing find cleanup), but bash's left associativity makes that swallow the entire chain -- including the `[ "$success" = "1" ]` check from the 3-attempt retry loop. When `kernels download python` fails permanently (e.g. no aarch64 variant in kernels-community/sgl-flash-attn3), the RUN reports DONE 0 anyway and the subsequent `mkdir -p /root/.cache/sglang && mv python/kernels.lock` steps are skipped. The runtime stage's COPY --from=framework_final /root/.cache/sglang /root/.cache/sglang then fails with a misleading "not found" two stages later. This has been silently breaking aarch64 nightly builds since sgl-project#22160 (2026-04-09) appended the `find ... 2>/dev/null || true` cleanup to the same `&&` chain as the retry loop that sgl-project#22322 added the same day. kernels-community/sgl-flash-attn3 publishes only x86_64-linux variants for the pinned revision, so all 3 retries always fail on arm; the silent-failure bug then hides that inside a DONE 0 layer. Fix: - Scope the trailing `|| true` to only the cleanup `find` (subshell) - Skip the cubin download on aarch64 with an explicit log message naming the upstream HF repo; kernels JIT-compile at runtime - Defensively `mkdir -p /root/.cache/huggingface /root/.cache/sglang`. `/root/.cache/huggingface` was previously created as a side effect of `kernels download python` (HF Hub cache). With that call skipped on aarch64, the subsequent `COPY --from=framework_final /root/.cache/huggingface ...` in the runtime stage would fail the same way `/root/.cache/sglang` did. Explicitly `mkdir -p` both paths to eliminate this class of bug entirely. - Move the kernels.lock guard with `if [ -f ... ]; then mv ...; fi`. This loud-fails on a real `mv` error (filesystem, permissions) rather than masking it with `|| true` -- consistent with the silent-failure lesson of this very fix. Repro pre-fix: any aarch64 build of this Dockerfile fails with `COPY ... /root/.cache/sglang: not found`; the actual error ("Cannot find a build variant for this system ... only x86_64-linux available") is buried ~200 lines earlier in the build log. Repro post-fix: aarch64 builds succeed; the build log clearly states the cubin download is being skipped and why. Signed-off-by: misunp <misunp@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the Dockerfile to support aarch64 architectures by skipping the sgl-kernel cubin download, which is currently unavailable for ARM, and allowing for JIT compilation at runtime. It also adds the creation of the /root/.cache/huggingface directory and includes a conditional check before moving the kernels.lock file to prevent potential errors. I have no feedback to provide.
|
E2E verified on aarch64 against this PR’s HEAD (aa454fb): docker build --target runtime completes successfully end-to-end, and running sglang.launch_server in the resulting image serves the meta-llama/Llama-3.1-8B-Instruct benchmark (50 requests) without issues. Ready to merge — thanks for the review! |
|
@Fridge003 could you review this? Thanks! |
|
LGTM |
Bumps `lmsysorg/sglang:v0.5.10.post1-runtime` → `lmsysorg/sglang:v0.5.11-runtime` (and the `-cu130-runtime` variant) across: - container/context.yaml - container/compliance/README.md Verified the tags resolve on Docker Hub: digest sha256:6f81caf1d2a24b2cfc212410900c14d633302bc16e6ec8379c0d382e625ab313 These were published by sgl-project/sglang's `Release Docker Runtime Images` workflow run https://github.com/sgl-project/sglang/actions/runs/25470428234 (workflow_dispatch from main with version=0.5.11). The workflow uses the dispatched ref's Dockerfile, so main's already-fixed Dockerfile (sgl-project/sglang#24234) builds against v0.5.11 source — no need to wait for that fix to be cherry-picked into release/v0.5.11. Also drop the libjsoncpp25 apt workaround from container/templates/sglang_runtime.Dockerfile. v0.5.11's runtime image bundles libjsoncpp inside the mooncake wheel itself (/usr/local/lib/python3.12/dist-packages/mooncake_transfer_engine_cuda13.libs/ libjsoncpp-7d699962.so.1.9.5), so `from mooncake.engine import TransferEngine` now succeeds without the system-level package. Verified via `docker run --rm --runtime=runc --entrypoint bash lmsysorg/sglang:v0.5.11-runtime -c 'python3 -c "from mooncake.engine import TransferEngine"'`.
Motivation
Fix two compounded issues in
framework_finalofdocker/Dockerfilethat have been silently breaking aarch64 nightly Docker builds since #22160 (2026-04-09):Silent-failure bug. The cubin-download retry block ends with
... || true(intended only to guard the trailingfindcleanup), but bash's left-associative&&/||precedence makes that swallow the entire chain — including the[ "$success" = "1" ]fail-fast check that [Docker] Fix Trivy CVEs, cubin download 403s, and kernels command order #22322 added the same day. Whenkernels download pythonfails permanently, the RUN reportsDONE 0, the subsequentmkdir -p /root/.cache/sglang && mv python/kernels.locksteps are skipped, and the runtime stage'sCOPY --from=framework_final /root/.cache/sglang /root/.cache/sglangthen fails with a misleading"not found"two stages later.No aarch64 cubin variants.
kernels-community/sgl-flash-attn3(pinned via[tool.kernels.dependencies]inpython/pyproject.toml) only publishesx86_64-linuxbuild variants. On aarch64 hosts (e.g., Grace Blackwell builds), all 3 retries always fail; the silent-failure bug above hides this inside aDONE 0layer.This affects every aarch64 build of this Dockerfile — observed in NVIDIA-internal nightly pipelines for several weeks. x86_64 builds are unaffected (they find a matching prebuilt cubin variant).
Modifications
In
docker/Dockerfile, theframework_finaleditable-install RUN block:|| trueto the cleanup find only. Wrap the trailingfind ... -exec rm -rf {} +in a subshell so the|| trueno longer swallows earlier failures in the&&chain. This is the actual silent-failure fix; it makes future regressions in the retry/install chain fail loudly at the right step, not three stages later inCOPY.kernels-community/sgl-flash-attn3. JIT compilation handles the kernels at runtime (which the silent-failure path was implicitly already doing). A short reference comment marks the branch for removal once arm cubins are published upstream.mkdir -p /root/.cache/huggingface /root/.cache/sglang./root/.cache/huggingfacewas previously created as a side effect ofkernels download python(HF Hub cache). With that call skipped on aarch64, the adjacentCOPY --from=framework_final /root/.cache/huggingface ...in the runtime stage would fail the same way/root/.cache/sglangdid. Eliminating this class of silent-skip-then-COPY-fails bug in one place.[ -f kernels.lock ] && mv ... || truewithif [ -f ... ]; then mv ...; fi. Loud-fails on realmverrors (filesystem, permissions) rather than masking them — consistent with the silent-failure lesson of this very fix.python/pyproject.tomlis intentionally untouched: removingkernels-community/sgl-flash-attn3from[tool.kernels.dependencies]would also break x86, which can use the prebuilt cubins. Selective skip in the Dockerfile is the right scope.Diff size:
1 file changed, 18 insertions(+), 9 deletions(-).Accuracy Tests
N/A — Dockerfile-only build fix. No model output, no kernel/forward changes.
Speed Tests and Profiling
N/A — Dockerfile-only build fix. The functional behavior on aarch64 is unchanged from the prior silent-failure path: cubins were not actually being installed (the retry was failing every time), so kernels were already being JIT-compiled at runtime. This PR just makes that explicit and removes the misleading
COPY: not foundfailure mode.Verification
Before (representative aarch64 nightly run, sglang
bcb34da9f)After (expected)
On aarch64:
On x86_64: behavior unchanged — retry loop runs as before, fail-fast on permanent download error preserved.
I will attach links to a green NVIDIA-internal aarch64 nightly Docker build run once the patched build cycle completes.
Checklist
$(uname -m)is used elsewhere at lines 146, 195, 201).docker/Dockerfile.Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci