[docker] Fix silently-masked cubin download failure; skip prebuilt cubins on aarch64 by mispa-ms · Pull Request #24234 · sgl-project/sglang

mispa-ms · 2026-05-01T16:55:20Z

Motivation

Fix two compounded issues in framework_final of docker/Dockerfile that have been silently breaking aarch64 nightly Docker builds since #22160 (2026-04-09):

Silent-failure bug. The cubin-download retry block ends with ... || true (intended only to guard the trailing find cleanup), but bash's left-associative &&/|| precedence makes that swallow the entire chain — including the [ "$success" = "1" ] fail-fast check that [Docker] Fix Trivy CVEs, cubin download 403s, and kernels command order #22322 added the same day. When kernels download python fails permanently, the RUN reports DONE 0, the subsequent mkdir -p /root/.cache/sglang && mv python/kernels.lock steps are skipped, and the runtime stage's COPY --from=framework_final /root/.cache/sglang /root/.cache/sglang then fails with a misleading "not found" two stages later.
No aarch64 cubin variants. kernels-community/sgl-flash-attn3 (pinned via [tool.kernels.dependencies] in python/pyproject.toml) only publishes x86_64-linux build variants. On aarch64 hosts (e.g., Grace Blackwell builds), all 3 retries always fail; the silent-failure bug above hides this inside a DONE 0 layer.

This affects every aarch64 build of this Dockerfile — observed in NVIDIA-internal nightly pipelines for several weeks. x86_64 builds are unaffected (they find a matching prebuilt cubin variant).

Modifications

In docker/Dockerfile, the framework_final editable-install RUN block:

Scope || true to the cleanup find only. Wrap the trailing find ... -exec rm -rf {} + in a subshell so the || true no longer swallows earlier failures in the && chain. This is the actual silent-failure fix; it makes future regressions in the retry/install chain fail loudly at the right step, not three stages later in COPY.
Skip the cubin download on aarch64 with an explicit log message naming kernels-community/sgl-flash-attn3. JIT compilation handles the kernels at runtime (which the silent-failure path was implicitly already doing). A short reference comment marks the branch for removal once arm cubins are published upstream.
Defensive mkdir -p /root/.cache/huggingface /root/.cache/sglang. /root/.cache/huggingface was previously created as a side effect of kernels download python (HF Hub cache). With that call skipped on aarch64, the adjacent COPY --from=framework_final /root/.cache/huggingface ... in the runtime stage would fail the same way /root/.cache/sglang did. Eliminating this class of silent-skip-then-COPY-fails bug in one place.
Replace [ -f kernels.lock ] && mv ... || true with if [ -f ... ]; then mv ...; fi. Loud-fails on real mv errors (filesystem, permissions) rather than masking them — consistent with the silent-failure lesson of this very fix.

python/pyproject.toml is intentionally untouched: removing kernels-community/sgl-flash-attn3 from [tool.kernels.dependencies] would also break x86, which can use the prebuilt cubins. Selective skip in the Dockerfile is the right scope.

Diff size: 1 file changed, 18 insertions(+), 9 deletions(-).

Accuracy Tests

N/A — Dockerfile-only build fix. No model output, no kernel/forward changes.

Speed Tests and Profiling

N/A — Dockerfile-only build fix. The functional behavior on aarch64 is unchanged from the prior silent-failure path: cubins were not actually being installed (the retry was failing every time), so kernels were already being JIT-compiled at runtime. This PR just makes that explicit and removes the misleading COPY: not found failure mode.

Verification

Before (representative aarch64 nightly run, sglang `bcb34da9f`)

#71 [framework_final 4/8] RUN ... kernels download python && success=1 && break ...
#71 21.45 Cannot find a build variant for this system in
          kernels-community/sgl-flash-attn3 (revision: b73eb6a16dea3a785f3c491f4d174d339684c4a3).
          Available variants: torch211-cxx11-cu130-x86_64-linux,
                              torch210-cxx11-cu128-x86_64-linux,
                              torch29-cxx11-cu129-x86_64-linux,
                              torch29-cxx11-cu128-x86_64-linux,
                              torch211-cxx11-cu128-x86_64-linux,
                              torch210-cxx11-cu130-x86_64-linux,
                              torch29-cxx11-cu130-x86_64-linux
... (3 attempts, all fail with same message) ...
#71 DONE 115.4s         <-- RUN reports success despite all failures
...
#81 [runtime 10/16] COPY --from=framework_final /root/.cache/sglang /root/.cache/sglang
#81 ERROR: failed to calculate checksum ... "/root/.cache/sglang": not found
ERROR: failed to build: ...

After (expected)

On aarch64:

#71 [framework_final 4/8] RUN ...
Skipping kernels-community/sgl-flash-attn3 cubin download on aarch64
(no variants published upstream); kernels will be JIT-compiled at runtime
#71 DONE ~5s
...
[runtime 10/16] COPY --from=framework_final /root/.cache/sglang /root/.cache/sglang
... build completes

On x86_64: behavior unchanged — retry loop runs as before, fail-fast on permanent download error preserved.

I will attach links to a green NVIDIA-internal aarch64 nightly Docker build run once the patched build cycle completes.

Checklist

Code style matches existing patterns in this Dockerfile ($(uname -m) is used elsewhere at lines 146, 195, 201).
[N/A] Unit tests — Dockerfile build fix; covered by repo's existing CI build of docker/Dockerfile.
[N/A] Documentation — no user-facing API change.
[N/A] Accuracy/speed benchmarks (see above).
Pre-commit: change is whitespace-sensitive Dockerfile shell continuation; no formatter-relevant changes.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

…bins on aarch64 The framework_final RUN block that downloads sgl-kernel cubins ends with `... || true` (intended for the trailing find cleanup), but bash's left associativity makes that swallow the entire chain -- including the `[ "$success" = "1" ]` check from the 3-attempt retry loop. When `kernels download python` fails permanently (e.g. no aarch64 variant in kernels-community/sgl-flash-attn3), the RUN reports DONE 0 anyway and the subsequent `mkdir -p /root/.cache/sglang && mv python/kernels.lock` steps are skipped. The runtime stage's COPY --from=framework_final /root/.cache/sglang /root/.cache/sglang then fails with a misleading "not found" two stages later. This has been silently breaking aarch64 nightly builds since sgl-project#22160 (2026-04-09) appended the `find ... 2>/dev/null || true` cleanup to the same `&&` chain as the retry loop that sgl-project#22322 added the same day. kernels-community/sgl-flash-attn3 publishes only x86_64-linux variants for the pinned revision, so all 3 retries always fail on arm; the silent-failure bug then hides that inside a DONE 0 layer. Fix: - Scope the trailing `|| true` to only the cleanup `find` (subshell) - Skip the cubin download on aarch64 with an explicit log message naming the upstream HF repo; kernels JIT-compile at runtime - Defensively `mkdir -p /root/.cache/huggingface /root/.cache/sglang`. `/root/.cache/huggingface` was previously created as a side effect of `kernels download python` (HF Hub cache). With that call skipped on aarch64, the subsequent `COPY --from=framework_final /root/.cache/huggingface ...` in the runtime stage would fail the same way `/root/.cache/sglang` did. Explicitly `mkdir -p` both paths to eliminate this class of bug entirely. - Move the kernels.lock guard with `if [ -f ... ]; then mv ...; fi`. This loud-fails on a real `mv` error (filesystem, permissions) rather than masking it with `|| true` -- consistent with the silent-failure lesson of this very fix. Repro pre-fix: any aarch64 build of this Dockerfile fails with `COPY ... /root/.cache/sglang: not found`; the actual error ("Cannot find a build variant for this system ... only x86_64-linux available") is buried ~200 lines earlier in the build log. Repro post-fix: aarch64 builds succeed; the build log clearly states the cubin download is being skipped and why. Signed-off-by: misunp <misunp@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the Dockerfile to support aarch64 architectures by skipping the sgl-kernel cubin download, which is currently unavailable for ARM, and allowing for JIT compilation at runtime. It also adds the creation of the /root/.cache/huggingface directory and includes a conditional check before moving the kernels.lock file to prevent potential errors. I have no feedback to provide.

mispa-ms · 2026-05-02T00:29:35Z

E2E verified on aarch64 against this PR’s HEAD (aa454fb): docker build --target runtime completes successfully end-to-end, and running sglang.launch_server in the resulting image serves the meta-llama/Llama-3.1-8B-Instruct benchmark (50 requests) without issues.

Ready to merge — thanks for the review!

nvpohanh · 2026-05-04T02:58:21Z

@Fridge003 could you review this? Thanks!

ishandhanani · 2026-05-04T16:07:24Z

LGTM

Bumps `lmsysorg/sglang:v0.5.10.post1-runtime` → `lmsysorg/sglang:v0.5.11-runtime` (and the `-cu130-runtime` variant) across: - container/context.yaml - container/compliance/README.md Verified the tags resolve on Docker Hub: digest sha256:6f81caf1d2a24b2cfc212410900c14d633302bc16e6ec8379c0d382e625ab313 These were published by sgl-project/sglang's `Release Docker Runtime Images` workflow run https://github.com/sgl-project/sglang/actions/runs/25470428234 (workflow_dispatch from main with version=0.5.11). The workflow uses the dispatched ref's Dockerfile, so main's already-fixed Dockerfile (sgl-project/sglang#24234) builds against v0.5.11 source — no need to wait for that fix to be cherry-picked into release/v0.5.11. Also drop the libjsoncpp25 apt workaround from container/templates/sglang_runtime.Dockerfile. v0.5.11's runtime image bundles libjsoncpp inside the mooncake wheel itself (/usr/local/lib/python3.12/dist-packages/mooncake_transfer_engine_cuda13.libs/ libjsoncpp-7d699962.so.1.9.5), so `from mooncake.engine import TransferEngine` now succeeds without the system-level package. Verified via `docker run --rm --runtime=runc --entrypoint bash lmsysorg/sglang:v0.5.11-runtime -c 'python3 -c "from mooncake.engine import TransferEngine"'`.

mispa-ms requested review from Fridge003, HaiShaw, ishandhanani, ispobock and yctseng0211 as code owners May 1, 2026 16:55

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

Merge branch 'main' into misunp/fix-docker-aarch-cubin-silent-skip

c3d85ab

ishandhanani approved these changes May 4, 2026

View reviewed changes

ishandhanani merged commit 62a4df0 into sgl-project:main May 4, 2026
41 checks passed

ishandhanani mentioned this pull request May 7, 2026

[v0.5.11] Cherry-pick #24234: fix silently-masked cubin download failure; skip prebuilt cubins on aarch64 #24567

Closed

2 tasks

ishandhanani mentioned this pull request May 7, 2026

feat(sglang): bump to 0.5.11 and prune 0.5.9 fallbacks in _compat ai-dynamo/dynamo#9230

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docker] Fix silently-masked cubin download failure; skip prebuilt cubins on aarch64#24234

[docker] Fix silently-masked cubin download failure; skip prebuilt cubins on aarch64#24234
ishandhanani merged 2 commits intosgl-project:mainfrom
mispa-ms:misunp/fix-docker-aarch-cubin-silent-skip

mispa-ms commented May 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mispa-ms commented May 2, 2026

Uh oh!

nvpohanh commented May 4, 2026

Uh oh!

ishandhanani commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mispa-ms commented May 1, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Verification

Before (representative aarch64 nightly run, sglang bcb34da9f)

After (expected)

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mispa-ms commented May 2, 2026

Uh oh!

nvpohanh commented May 4, 2026

Uh oh!

ishandhanani commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Before (representative aarch64 nightly run, sglang `bcb34da9f`)