[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError by Kangyan-Zhou · Pull Request #25958 · sgl-project/sglang

Kangyan-Zhou · 2026-05-21T07:13:52Z

Root cause

nvidia-cutlass-dsl[cu13] has additive PyPI extras — installing it pulls in both nvidia-cutlass-dsl-libs-base AND nvidia-cutlass-dsl-libs-cu13. The two wheels ship intentionally-different content for the same paths:

Path	`-libs-base`	`-libs-cu13`
`cutlass/_mlir/dialects/_gpu_ops_gen.py`	calls `super().__init__(self.build_generic(...))` (new-style single object)	calls `super().__init__(OPERATION_NAME, REGIONS, ...)` (old-style positional)
`cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-310-x86_64-linux-gnu.so`	pybind11 binding only accepts `(operation: object)`	pybind11 binding only accepts positional args

Each wheel's .py is paired with a .so that has the matching API. If install order leaves the .py from one wheel and the .so from the other (which can happen via uv's install ordering), you get the hard TypeError seen in CI:

File ".../cutlass/_mlir/dialects/_gpu_ops_gen.py", line 1357, in __init__
    super().__init__(self.OPERATION_NAME, self._ODS_REGIONS, ...)
TypeError: __init__(): incompatible function arguments. The following argument types are supported:
    1. __init__(self, operation: object) -> None

This surfaces at kernel-compile time on CU13 CI runners during eagle / lora tests that go through flashinfer.rmsnorm_cute → cute.compile.

Empirical evidence

Tested all 4 combinations on an H200 devbox by manually cp-ing wheel contents into site-packages:

`.py` from	`.so` from	Smoke test (`gpu.GPUModuleOp(StringAttr, loc=loc)`)
`-libs-base`	`-libs-base`	✅ PASS
`-libs-cu13`	`-libs-cu13`	✅ PASS
`-libs-cu13`	`-libs-base`	❌ FAIL — exact CI TypeError, byte-for-byte
`-libs-base`	`-libs-cu13`	✅ PASS

Three of four states work. Only the mismatched .py=cu13 + .so=base breaks.

Fix

After install_sglang completes (with possibly mismatched state), force-reinstall -libs-cu13 last to guarantee both .py and .so come from the same wheel (BOTH-cu13 state):

$PIP_CMD install --force-reinstall --no-deps \
  "nvidia-cutlass-dsl-libs-cu13==${CUTLASS_DSL_VERSION}" \
  $PIP_INSTALL_SUFFIX

Version parsed from pyproject.toml to stay in sync. Skips for non-CU13 runners (only -libs-base installed there, no conflict possible).

Validation on devbox

TypeError fix: forced BAD state on H200 devbox with UV_LINK_MODE=copy (matches CI), ran force_reinstall_cutlass_dsl_libs_cu13 — smoke test went FAIL → PASS, .so md5 changed from base's to cu13's.
LoRA regression check: ran test/registered/lora/test_lora_qwen3_8b_logprob_diff.py against the fix on the same devbox — both subtests passed, KL divergence 2.8e-4 (threshold 5e-3). The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression from Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main #25743.

Related PRs / supersedes

[Revert] nvidia-cutlass-dsl[cu13] 4.5.1 -> 4.5.0 #25938 (revert-only attempt) — superseded; the version bump wasn't the root cause
[Fix] Try to fix error caused by latest cutedsl packages #25690 / [Fix] Fix extra uninstall of cutlass packages #25756 / Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main #25743 / [Deps] Use cu13 extra for nvidia cutlass dsl #25576 — context for the wheel-mix history

🤖 Generated with Claude Code

CI States

Latest PR Test (Base): ❌ Run #26216901406
Latest PR Test (Extra): ❌ Run #26216901321

…-mix TypeError nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base and -libs-cu13 are installed and they ship intentionally-different content for the same site-packages paths: cutlass/_mlir/dialects/_gpu_ops_gen.py cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-*.so Each wrapper .py is paired with a matching pybind11 .so. The two pairs use different MLIR Op constructor styles: -libs-base: super().__init__(self.build_generic(...)) (new-style) -libs-cu13: super().__init__(OPERATION_NAME, REGIONS, ...) (old-style) If install order leaves the .py from one wheel and the .so from the other (reproducible by mixing the wheel contents), the wrapper's super().__init__ call signature does not match what the loaded .so accepts and the runtime raises: TypeError: __init__(): incompatible function arguments. 1. __init__(self, operation: object) -> None surfacing at kernel-compile time on H100 CU13 CI runners during eagle / lora tests that go through flashinfer.rmsnorm_cute -> cute.compile. Tested all 4 (.py, .so) combinations on an H200 devbox: only the mismatched '.py=cu13 + .so=base' fails, producing the exact CI TypeError byte-for-byte. Three combinations pass. Fix: after install_sglang completes (with possibly mismatched state), force-reinstall -libs-cu13 last so both .py and .so come from the same wheel (BOTH-cu13 state). The version is parsed from pyproject.toml so this stays in sync with whatever nvidia-cutlass-dsl version the project pins. Skips for non-CU13 runners (no [cu13] extra, no conflict). Verified on an H200 devbox: 1. TypeError fix: forced bad state, ran force_reinstall_cutlass_dsl_libs_cu13 -> smoke test went FAIL -> PASS, .so md5 changed from base's to cu13's. 2. LoRA regression check: ran test_lora_qwen3_8b_logprob_diff.py -> both subtests passed, KL divergence 2.8e-4 (threshold 5e-3). The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression from sgl-project#25743.

gemini-code-assist

Code Review

This pull request updates the nvidia-cutlass-dsl[cu13] dependency to version 4.5.1 and adds a force_reinstall_cutlass_dsl_libs_cu13 function to the CI installation script to prevent library mismatches. Feedback was provided to use the ${REPO_ROOT} variable for the pyproject.toml file path in the script to ensure it is correctly located regardless of the current working directory.

gemini-code-assist · 2026-05-21T07:16:23Z

+        return
+    fi
+
+    CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "")


Using a relative path for python/pyproject.toml makes the script's behavior dependent on the current working directory. Since REPO_ROOT is already defined and used elsewhere in this script for robustness, it should be used here as well. Additionally, quoting the path is a good practice to handle potential spaces in the directory name.

Suggested change

CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "")

CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\\[[^]]+\\])?==\\K[0-9A-Za-z\\.\\-]+' "${REPO_ROOT}/python/pyproject.toml" || echo "")

Kangyan-Zhou · 2026-05-21T08:52:43Z

@mmangkad I think your suggestion is correct, thanks for sharing it!

Kangyan-Zhou · 2026-05-21T08:52:55Z

/tag-and-rerun-ci

- Use "${REPO_ROOT}/python/pyproject.toml" instead of relative path so the version probe doesn't depend on the working directory the script is launched from (per gemini-code-assist review). - Bump nvidia-cutlass-dsl[cu13] 4.5.0 -> 4.5.1 now that the wheel-mix TypeError is mitigated by force_reinstall_cutlass_dsl_libs_cu13. This re-applies sgl-project#25576 which was rolled back in sgl-project#25938 only because of the install-order bug.

mmangkad · 2026-05-21T08:56:08Z

@mmangkad I think your suggestion is correct, thanks for sharing it!

Yeah that was the issue because the order of install matters, not the version. ~~Could we include the upgrade back to 4.5.1 here?~~ I just saw it

…-mix TypeError (sgl-project#25958) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>

#25958's wheel-mix fix (force_reinstall_cutlass_dsl_libs_cu13) is what solves the install-time TypeError; the accompanying 4.5.0->4.5.1 bump isn't required for the fix and reintroduces a runtime regression. py-spy on a hanging b200 test (DeepSeek-V3.2-NVFP4 + DSA + EAGLE) shows the scheduler stuck in fp4_gemm autotune at: cutlass/cute/nvgpu/tcgen05/mma.py:557 -> findsource (inspect.py:997) -> [flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py loop body] The per-kernel-emission inspect.findsource walk is O(N) over loaded modules and never finishes in 4.5.1 within the 30-min step budget. 4.5.0 doesn't hit this path (per main running this test cleanly). Holding at 4.5.0 keeps us aligned with the prior team-wide revert (#25938) while keeping the install-order safety helper from #25958.

…-mix TypeError (sgl-project#25958) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>

PyTorch 2.11.0+cu130 bundles an older nvidia-cutlass-dsl that has incompatible MLIR bindings with FlashInfer 0.6.11.post1's rmsnorm_cute kernel. Force-reinstall cutlass-dsl>=4.5.2 after torch re-pin to ensure compatible GPUModuleOp API during CUDA graph capture. Upstream SGLang applies the same fix (sgl-project/sglang#25958).

* add sglang amzn2023 autorelease Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * update sglang to cuda 13 Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * update build script Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix tagging Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix dockerfile path Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * add cuda ref Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * update cron Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix quoting Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix linking Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * clean tags Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * ad sglang amzn2023 allowlist Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix telemetry Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix telemetry Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * update model throughput threshold Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * udpate sglang port Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * mode port killing mechanism Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * kill port 30000 Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * add port randomization Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * revert port to 8000 Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * revert port Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * debug port logs Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * check ports Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix debug Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix debug Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * revert temp changes Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * revert debug statemtns Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix: move FP8 and large models to H100 runners, allowlist CVE-2026-42504 FP8 models (qwen3.5-35b-a3b-fp8, qwen3-coder-next-fp8) require fp8e4nv which is only supported on Hopper (sm_90+). The gpu-l40s-4gpu-runners label doesn't exist, causing fallback to gpu-efa-runners (A100 sm_80). LLaMA 3.3 70B OOMs on A100 runners. Move all three to gpu-h100-8gpu-runners with tp=8 and appropriate memory settings. Add CVE-2026-42504 to security allowlist — go/stdlib MIME header CPU exhaustion in mooncake libetcd_wrapper.so, same root cause as existing Go stdlib entries. * fix: pin nvidia-cutlass-dsl>=4.5.2 to fix FlashInfer CUTE rmsnorm crash PyTorch 2.11.0+cu130 bundles an older nvidia-cutlass-dsl that has incompatible MLIR bindings with FlashInfer 0.6.11.post1's rmsnorm_cute kernel. Force-reinstall cutlass-dsl>=4.5.2 after torch re-pin to ensure compatible GPUModuleOp API during CUDA graph capture. Upstream SGLang applies the same fix (sgl-project/sglang#25958). * fix: revert FP8 MoE models to tp=4 and move qwen3-32b to dedicated pod Benchmark run 27228675384 surfaced three distinct failures: - qwen3.5-35b-a3b-fp8 / qwen3-coder-next-fp8: tp=8 shards the FP8 MoE gate/up output_size to 64, which is not divisible by block_n=128 ("output_size ... not divisible by weight quantization block_n=128"). Revert to tp=4 — the intended sharding for these FP8 models. - qwen3-32b: shared gpu-efa-runners pod had a leftover process holding port 8000 ("address already in use" -> warmup timeout). Move to a dedicated gpu-h100-8gpu-runners pod to avoid the collision. llama-3.3-70b stays at tp=8 (dense model, no block-quant constraint, needs the memory headroom). * fix: use default SGLang port for GPU benchmarks and disable piecewise CUDA graph All gpu-h100-8gpu-runners benchmark jobs failed at server startup with '[Errno 98] address already in use' on port 8000; port 8000 is occupied on those pods. Remove the SGLANG_PORT=8000 override from the five GPU models so they use the SGLang default (30000), matching the x86 jobs that already pass. Also add --disable-piecewise-cuda-graph to qwen3-32b: it crashed during warmup_compile with 'FusedAddRMSNorm ... illegal memory access' while capturing the experimental piecewise CUDA graph (same workaround as llama-3.3-70b). --------- Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> Co-authored-by: Jyothirmai Kottu <jkottu@amazon.com>

Kangyan-Zhou requested review from Fridge003, ispobock and merrymercy as code owners May 21, 2026 07:13

github-actions Bot added the dependencies Pull requests that update a dependency file label May 21, 2026

Kangyan-Zhou force-pushed the fix_cutlass_libs_install_order branch from de055b3 to 1a0dbf2 Compare May 21, 2026 07:14

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

github-actions Bot added the run-ci label May 21, 2026

Kangyan-Zhou added the bypass-fastfail label May 21, 2026

Merge branch 'main' into fix_cutlass_libs_install_order

13f8cf2

Kangyan-Zhou merged commit caa9f08 into sgl-project:main May 21, 2026
253 of 332 checks passed

nvpohanh mentioned this pull request May 22, 2026

[NVIDIA] [GDN] Enable FlashInfer MTP verify on SM100+ (Blackwell) #23273

Merged

amd-bot mentioned this pull request May 24, 2026

[CI Monitor] Daily Report - 2026-05-24 bingxche/sglang-ci-bot#82

Open

Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel…

7ac6cfe

…-mix TypeError (sgl-project#25958) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>

mqhc2020 pushed a commit to mqhc2020/sglang that referenced this pull request Jun 2, 2026

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel…

ff0e7f3

…-mix TypeError (sgl-project#25958) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>

alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Jun 4, 2026

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel…

ce59b1c

…-mix TypeError (sgl-project#25958) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>

Jyothirmaikottu mentioned this pull request Jun 9, 2026

[Docker] Force-reinstall nvidia-cutlass-dsl-libs-cu13 after torch for CUDA 13 #27707

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958
Kangyan-Zhou merged 3 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_cutlass_libs_install_order

Kangyan-Zhou commented May 21, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

Kangyan-Zhou commented May 21, 2026

Uh oh!

Kangyan-Zhou commented May 21, 2026

Uh oh!

mmangkad commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml \|\| echo "")
	CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\\[[^]]+\\])?==\\K[0-9A-Za-z\\.\\-]+' "${REPO_ROOT}/python/pyproject.toml" \|\| echo "")

Conversation

Kangyan-Zhou commented May 21, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Empirical evidence

Fix

Validation on devbox

Related PRs / supersedes

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Kangyan-Zhou commented May 21, 2026

Uh oh!

Kangyan-Zhou commented May 21, 2026

Uh oh!

mmangkad commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kangyan-Zhou commented May 21, 2026 •

edited by github-actions Bot

Loading

mmangkad commented May 21, 2026 •

edited

Loading