Skip to content

[v0.5.11] Cherry-pick #24234: fix silently-masked cubin download failure; skip prebuilt cubins on aarch64#24567

Closed
ishandhanani wants to merge 1 commit intorelease/v0.5.11from
idhanani/cherry-pick-24234-v0.5.11
Closed

[v0.5.11] Cherry-pick #24234: fix silently-masked cubin download failure; skip prebuilt cubins on aarch64#24567
ishandhanani wants to merge 1 commit intorelease/v0.5.11from
idhanani/cherry-pick-24234-v0.5.11

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

Motivation

Cherry-picks #24234 (62a4df006 on main) onto release/v0.5.11 so the Release Docker Runtime Images workflow can produce working lmsysorg/sglang:v0.5.11-runtime and v0.5.11-cu130-runtime images.

The runtime build for v0.5.11 has been failing on every attempt — most recently observed in https://github.com/sgl-project/sglang/actions/runs/25362406194 (build-arm64 failed → x86 + manifests cancelled by fail-fast). The arm64 failure traces back to:

Cannot find a build variant for this system in
kernels-community/sgl-flash-attn3 (revision: 9ce63346175212727a7ed4a53a62b5faf2a4fe83).
Available variants: torch29-cxx11-cu130-x86_64-linux, ...
sgl-kernel cubin download failed, retrying in 30s...
#70 DONE 113.1s         ← RUN reports success despite all 3 retries failing
...
#80 [runtime 10/15] COPY --from=framework_final /root/.cache/sglang /root/.cache/sglang
#80 ERROR: failed to calculate checksum ... "/root/.cache/sglang": not found

Same root cause #24234 already fixed on main: the trailing || true in the editable-install RUN block (intended for find cleanup) silently swallows the entire && chain — including [ "$success" = "1" ] — so kernel-download failures on aarch64 produce a runtime image with /root/.cache/sglang missing, and the runtime stage's COPY blows up two stages later with a misleading "not found".

The fix isn't in v0.5.11 because #24234 was merged to main after the v0.5.11 tag was cut (git merge-base --is-ancestor 62a4df006 v0.5.11 → false). Only the nixl-stub cherry-pick (#24369#24382) made it into the tag.

Downstream impact: ai-dynamo/dynamo can't bump its container references to lmsysorg/sglang:v0.5.11-runtime until the runtime workflow can produce them, which gates on this cherry-pick.

Modifications

git cherry-pick 62a4df006 onto release/v0.5.11. No conflicts; the only Dockerfile commit between v0.5.11 and the cherry-pick on release/v0.5.11 was the existing nixl-stub pick (#24382). Diff matches the upstream PR exactly (1 file, +18 / -9).

```
e48812a [docker] Fix silently-masked cubin download failure; skip prebuilt cubins on aarch64 (#24234)
```

Verification

After this lands, either:

  • Move the v0.5.11 tag to the new HEAD of release/v0.5.11 and re-trigger Release Docker Runtime Images via the tag push, or
  • Cut a v0.5.11.post1 tag from release/v0.5.11 (cleaner; auto-triggers the workflow), or
  • workflow_dispatch Release Docker Runtime Images from main with version=0.5.11 (uses main's already-fixed Dockerfile against v0.5.11 Python source — works without retagging since actions/checkout honors github.ref and the Dockerfile only git clone --branch v\${SGL_VERSION} for the source).

I'm taking option 3 to unblock immediately, but cherry-picking here is still the right thing so future tag-triggered builds and any v0.5.11.post1 cut from this branch include the fix.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process.
  2. Approval from CODEOWNERS for docker/.
  3. Merge into release/v0.5.11.
  4. Cut v0.5.11.post1 (or move the v0.5.11 tag) to trigger the runtime workflow.

…bins on aarch64 (#24234)

Signed-off-by: misunp <misunp@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ishandhanani ishandhanani marked this pull request as ready for review May 7, 2026 01:14
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants