Skip to content

[docker] Fix silently-masked cubin download failure; skip prebuilt cubins on aarch64#24234

Merged
ishandhanani merged 2 commits intosgl-project:mainfrom
mispa-ms:misunp/fix-docker-aarch-cubin-silent-skip
May 4, 2026
Merged

[docker] Fix silently-masked cubin download failure; skip prebuilt cubins on aarch64#24234
ishandhanani merged 2 commits intosgl-project:mainfrom
mispa-ms:misunp/fix-docker-aarch-cubin-silent-skip

Conversation

@mispa-ms
Copy link
Copy Markdown
Contributor

@mispa-ms mispa-ms commented May 1, 2026

Motivation

Fix two compounded issues in framework_final of docker/Dockerfile that have been silently breaking aarch64 nightly Docker builds since #22160 (2026-04-09):

  1. Silent-failure bug. The cubin-download retry block ends with ... || true (intended only to guard the trailing find cleanup), but bash's left-associative &&/|| precedence makes that swallow the entire chain — including the [ "$success" = "1" ] fail-fast check that [Docker] Fix Trivy CVEs, cubin download 403s, and kernels command order #22322 added the same day. When kernels download python fails permanently, the RUN reports DONE 0, the subsequent mkdir -p /root/.cache/sglang && mv python/kernels.lock steps are skipped, and the runtime stage's COPY --from=framework_final /root/.cache/sglang /root/.cache/sglang then fails with a misleading "not found" two stages later.

  2. No aarch64 cubin variants. kernels-community/sgl-flash-attn3 (pinned via [tool.kernels.dependencies] in python/pyproject.toml) only publishes x86_64-linux build variants. On aarch64 hosts (e.g., Grace Blackwell builds), all 3 retries always fail; the silent-failure bug above hides this inside a DONE 0 layer.

This affects every aarch64 build of this Dockerfile — observed in NVIDIA-internal nightly pipelines for several weeks. x86_64 builds are unaffected (they find a matching prebuilt cubin variant).

Modifications

In docker/Dockerfile, the framework_final editable-install RUN block:

  • Scope || true to the cleanup find only. Wrap the trailing find ... -exec rm -rf {} + in a subshell so the || true no longer swallows earlier failures in the && chain. This is the actual silent-failure fix; it makes future regressions in the retry/install chain fail loudly at the right step, not three stages later in COPY.
  • Skip the cubin download on aarch64 with an explicit log message naming kernels-community/sgl-flash-attn3. JIT compilation handles the kernels at runtime (which the silent-failure path was implicitly already doing). A short reference comment marks the branch for removal once arm cubins are published upstream.
  • Defensive mkdir -p /root/.cache/huggingface /root/.cache/sglang. /root/.cache/huggingface was previously created as a side effect of kernels download python (HF Hub cache). With that call skipped on aarch64, the adjacent COPY --from=framework_final /root/.cache/huggingface ... in the runtime stage would fail the same way /root/.cache/sglang did. Eliminating this class of silent-skip-then-COPY-fails bug in one place.
  • Replace [ -f kernels.lock ] && mv ... || true with if [ -f ... ]; then mv ...; fi. Loud-fails on real mv errors (filesystem, permissions) rather than masking them — consistent with the silent-failure lesson of this very fix.

python/pyproject.toml is intentionally untouched: removing kernels-community/sgl-flash-attn3 from [tool.kernels.dependencies] would also break x86, which can use the prebuilt cubins. Selective skip in the Dockerfile is the right scope.

Diff size: 1 file changed, 18 insertions(+), 9 deletions(-).

Accuracy Tests

N/A — Dockerfile-only build fix. No model output, no kernel/forward changes.

Speed Tests and Profiling

N/A — Dockerfile-only build fix. The functional behavior on aarch64 is unchanged from the prior silent-failure path: cubins were not actually being installed (the retry was failing every time), so kernels were already being JIT-compiled at runtime. This PR just makes that explicit and removes the misleading COPY: not found failure mode.

Verification

Before (representative aarch64 nightly run, sglang bcb34da9f)

#71 [framework_final 4/8] RUN ... kernels download python && success=1 && break ...
#71 21.45 Cannot find a build variant for this system in
          kernels-community/sgl-flash-attn3 (revision: b73eb6a16dea3a785f3c491f4d174d339684c4a3).
          Available variants: torch211-cxx11-cu130-x86_64-linux,
                              torch210-cxx11-cu128-x86_64-linux,
                              torch29-cxx11-cu129-x86_64-linux,
                              torch29-cxx11-cu128-x86_64-linux,
                              torch211-cxx11-cu128-x86_64-linux,
                              torch210-cxx11-cu130-x86_64-linux,
                              torch29-cxx11-cu130-x86_64-linux
... (3 attempts, all fail with same message) ...
#71 DONE 115.4s         <-- RUN reports success despite all failures
...
#81 [runtime 10/16] COPY --from=framework_final /root/.cache/sglang /root/.cache/sglang
#81 ERROR: failed to calculate checksum ... "/root/.cache/sglang": not found
ERROR: failed to build: ...

After (expected)

On aarch64:

#71 [framework_final 4/8] RUN ...
Skipping kernels-community/sgl-flash-attn3 cubin download on aarch64
(no variants published upstream); kernels will be JIT-compiled at runtime
#71 DONE ~5s
...
[runtime 10/16] COPY --from=framework_final /root/.cache/sglang /root/.cache/sglang
... build completes

On x86_64: behavior unchanged — retry loop runs as before, fail-fast on permanent download error preserved.

I will attach links to a green NVIDIA-internal aarch64 nightly Docker build run once the patched build cycle completes.

Checklist

  • Code style matches existing patterns in this Dockerfile ($(uname -m) is used elsewhere at lines 146, 195, 201).
  • [N/A] Unit tests — Dockerfile build fix; covered by repo's existing CI build of docker/Dockerfile.
  • [N/A] Documentation — no user-facing API change.
  • [N/A] Accuracy/speed benchmarks (see above).
  • Pre-commit: change is whitespace-sensitive Dockerfile shell continuation; no formatter-relevant changes.

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

…bins on aarch64

The framework_final RUN block that downloads sgl-kernel cubins ends
with `... || true` (intended for the trailing find cleanup), but
bash's left associativity makes that swallow the entire chain --
including the `[ "$success" = "1" ]` check from the 3-attempt retry
loop. When `kernels download python` fails permanently (e.g. no
aarch64 variant in kernels-community/sgl-flash-attn3), the RUN reports
DONE 0 anyway and the subsequent `mkdir -p /root/.cache/sglang &&
mv python/kernels.lock` steps are skipped. The runtime stage's
  COPY --from=framework_final /root/.cache/sglang /root/.cache/sglang
then fails with a misleading "not found" two stages later.

This has been silently breaking aarch64 nightly builds since sgl-project#22160
(2026-04-09) appended the `find ... 2>/dev/null || true` cleanup to
the same `&&` chain as the retry loop that sgl-project#22322 added the same day.
kernels-community/sgl-flash-attn3 publishes only x86_64-linux variants
for the pinned revision, so all 3 retries always fail on arm; the
silent-failure bug then hides that inside a DONE 0 layer.

Fix:
  - Scope the trailing `|| true` to only the cleanup `find` (subshell)
  - Skip the cubin download on aarch64 with an explicit log message
    naming the upstream HF repo; kernels JIT-compile at runtime
  - Defensively `mkdir -p /root/.cache/huggingface /root/.cache/sglang`.
    `/root/.cache/huggingface` was previously created as a side effect
    of `kernels download python` (HF Hub cache). With that call skipped
    on aarch64, the subsequent `COPY --from=framework_final
    /root/.cache/huggingface ...` in the runtime stage would fail the
    same way `/root/.cache/sglang` did. Explicitly `mkdir -p` both
    paths to eliminate this class of bug entirely.
  - Move the kernels.lock guard with `if [ -f ... ]; then mv ...; fi`.
    This loud-fails on a real `mv` error (filesystem, permissions)
    rather than masking it with `|| true` -- consistent with the
    silent-failure lesson of this very fix.

Repro pre-fix: any aarch64 build of this Dockerfile fails with
`COPY ... /root/.cache/sglang: not found`; the actual error
("Cannot find a build variant for this system ... only x86_64-linux
available") is buried ~200 lines earlier in the build log.

Repro post-fix: aarch64 builds succeed; the build log clearly states
the cubin download is being skipped and why.

Signed-off-by: misunp <misunp@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Dockerfile to support aarch64 architectures by skipping the sgl-kernel cubin download, which is currently unavailable for ARM, and allowing for JIT compilation at runtime. It also adds the creation of the /root/.cache/huggingface directory and includes a conditional check before moving the kernels.lock file to prevent potential errors. I have no feedback to provide.

@mispa-ms
Copy link
Copy Markdown
Contributor Author

mispa-ms commented May 2, 2026

E2E verified on aarch64 against this PR’s HEAD (aa454fb): docker build --target runtime completes successfully end-to-end, and running sglang.launch_server in the resulting image serves the meta-llama/Llama-3.1-8B-Instruct benchmark (50 requests) without issues.

Ready to merge — thanks for the review!

@nvpohanh
Copy link
Copy Markdown
Collaborator

nvpohanh commented May 4, 2026

@Fridge003 could you review this? Thanks!

@ishandhanani
Copy link
Copy Markdown
Collaborator

LGTM

@ishandhanani ishandhanani merged commit 62a4df0 into sgl-project:main May 4, 2026
41 checks passed
ishandhanani added a commit to ai-dynamo/dynamo that referenced this pull request May 7, 2026
Bumps `lmsysorg/sglang:v0.5.10.post1-runtime` →
`lmsysorg/sglang:v0.5.11-runtime` (and the `-cu130-runtime` variant)
across:
- container/context.yaml
- container/compliance/README.md

Verified the tags resolve on Docker Hub:
  digest sha256:6f81caf1d2a24b2cfc212410900c14d633302bc16e6ec8379c0d382e625ab313

These were published by sgl-project/sglang's `Release Docker Runtime
Images` workflow run https://github.com/sgl-project/sglang/actions/runs/25470428234
(workflow_dispatch from main with version=0.5.11). The workflow uses
the dispatched ref's Dockerfile, so main's already-fixed Dockerfile
(sgl-project/sglang#24234) builds against v0.5.11 source — no need
to wait for that fix to be cherry-picked into release/v0.5.11.

Also drop the libjsoncpp25 apt workaround from
container/templates/sglang_runtime.Dockerfile. v0.5.11's runtime image
bundles libjsoncpp inside the mooncake wheel itself
(/usr/local/lib/python3.12/dist-packages/mooncake_transfer_engine_cuda13.libs/
libjsoncpp-7d699962.so.1.9.5), so `from mooncake.engine import
TransferEngine` now succeeds without the system-level package. Verified
via `docker run --rm --runtime=runc --entrypoint bash
lmsysorg/sglang:v0.5.11-runtime -c 'python3 -c "from mooncake.engine
import TransferEngine"'`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants