Skip to content

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958

Merged
Kangyan-Zhou merged 3 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_cutlass_libs_install_order
May 21, 2026
Merged

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958
Kangyan-Zhou merged 3 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_cutlass_libs_install_order

Conversation

@Kangyan-Zhou

@Kangyan-Zhou Kangyan-Zhou commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Root cause

nvidia-cutlass-dsl[cu13] has additive PyPI extras — installing it pulls in both nvidia-cutlass-dsl-libs-base AND nvidia-cutlass-dsl-libs-cu13. The two wheels ship intentionally-different content for the same paths:

Path -libs-base -libs-cu13
cutlass/_mlir/dialects/_gpu_ops_gen.py calls super().__init__(self.build_generic(...)) (new-style single object) calls super().__init__(OPERATION_NAME, REGIONS, ...) (old-style positional)
cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-310-x86_64-linux-gnu.so pybind11 binding only accepts (operation: object) pybind11 binding only accepts positional args

Each wheel's .py is paired with a .so that has the matching API. If install order leaves the .py from one wheel and the .so from the other (which can happen via uv's install ordering), you get the hard TypeError seen in CI:

File ".../cutlass/_mlir/dialects/_gpu_ops_gen.py", line 1357, in __init__
    super().__init__(self.OPERATION_NAME, self._ODS_REGIONS, ...)
TypeError: __init__(): incompatible function arguments. The following argument types are supported:
    1. __init__(self, operation: object) -> None

This surfaces at kernel-compile time on CU13 CI runners during eagle / lora tests that go through flashinfer.rmsnorm_cutecute.compile.

Empirical evidence

Tested all 4 combinations on an H200 devbox by manually cp-ing wheel contents into site-packages:

.py from .so from Smoke test (gpu.GPUModuleOp(StringAttr, loc=loc))
-libs-base -libs-base ✅ PASS
-libs-cu13 -libs-cu13 ✅ PASS
-libs-cu13 -libs-base FAIL — exact CI TypeError, byte-for-byte
-libs-base -libs-cu13 ✅ PASS

Three of four states work. Only the mismatched .py=cu13 + .so=base breaks.

Fix

After install_sglang completes (with possibly mismatched state), force-reinstall -libs-cu13 last to guarantee both .py and .so come from the same wheel (BOTH-cu13 state):

$PIP_CMD install --force-reinstall --no-deps \
  "nvidia-cutlass-dsl-libs-cu13==${CUTLASS_DSL_VERSION}" \
  $PIP_INSTALL_SUFFIX

Version parsed from pyproject.toml to stay in sync. Skips for non-CU13 runners (only -libs-base installed there, no conflict possible).

Validation on devbox

  1. TypeError fix: forced BAD state on H200 devbox with UV_LINK_MODE=copy (matches CI), ran force_reinstall_cutlass_dsl_libs_cu13 — smoke test went FAIL → PASS, .so md5 changed from base's to cu13's.
  2. LoRA regression check: ran test/registered/lora/test_lora_qwen3_8b_logprob_diff.py against the fix on the same devbox — both subtests passed, KL divergence 2.8e-4 (threshold 5e-3). The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression from Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main #25743.

Related PRs / supersedes

🤖 Generated with Claude Code


CI States

Latest PR Test (Base): ❌ Run #26216901406
Latest PR Test (Extra): ❌ Run #26216901321

@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label May 21, 2026
…-mix TypeError

nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base and
-libs-cu13 are installed and they ship intentionally-different content
for the same site-packages paths:

  cutlass/_mlir/dialects/_gpu_ops_gen.py
  cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-*.so

Each wrapper .py is paired with a matching pybind11 .so. The two pairs
use different MLIR Op constructor styles:

  -libs-base: super().__init__(self.build_generic(...))  (new-style)
  -libs-cu13: super().__init__(OPERATION_NAME, REGIONS, ...) (old-style)

If install order leaves the .py from one wheel and the .so from the
other (reproducible by mixing the wheel contents), the wrapper's
super().__init__ call signature does not match what the loaded .so
accepts and the runtime raises:

  TypeError: __init__(): incompatible function arguments.
    1. __init__(self, operation: object) -> None

surfacing at kernel-compile time on H100 CU13 CI runners during eagle /
lora tests that go through flashinfer.rmsnorm_cute -> cute.compile.

Tested all 4 (.py, .so) combinations on an H200 devbox: only the
mismatched '.py=cu13 + .so=base' fails, producing the exact CI TypeError
byte-for-byte. Three combinations pass.

Fix: after install_sglang completes (with possibly mismatched state),
force-reinstall -libs-cu13 last so both .py and .so come from the same
wheel (BOTH-cu13 state). The version is parsed from pyproject.toml so
this stays in sync with whatever nvidia-cutlass-dsl version the project
pins. Skips for non-CU13 runners (no [cu13] extra, no conflict).

Verified on an H200 devbox:
  1. TypeError fix: forced bad state, ran force_reinstall_cutlass_dsl_libs_cu13
     -> smoke test went FAIL -> PASS, .so md5 changed from base's to cu13's.
  2. LoRA regression check: ran test_lora_qwen3_8b_logprob_diff.py
     -> both subtests passed, KL divergence 2.8e-4 (threshold 5e-3).
     The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS
     regression from sgl-project#25743.
@Kangyan-Zhou Kangyan-Zhou force-pushed the fix_cutlass_libs_install_order branch from de055b3 to 1a0dbf2 Compare May 21, 2026 07:14

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the nvidia-cutlass-dsl[cu13] dependency to version 4.5.1 and adds a force_reinstall_cutlass_dsl_libs_cu13 function to the CI installation script to prevent library mismatches. Feedback was provided to use the ${REPO_ROOT} variable for the pyproject.toml file path in the script to ensure it is correctly located regardless of the current working directory.

return
fi

CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a relative path for python/pyproject.toml makes the script's behavior dependent on the current working directory. Since REPO_ROOT is already defined and used elsewhere in this script for robustness, it should be used here as well. Additionally, quoting the path is a good practice to handle potential spaces in the directory name.

Suggested change
CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "")
CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\\[[^]]+\\])?==\\K[0-9A-Za-z\\.\\-]+' "${REPO_ROOT}/python/pyproject.toml" || echo "")

@Kangyan-Zhou

Copy link
Copy Markdown
Collaborator Author

@mmangkad I think your suggestion is correct, thanks for sharing it!

@Kangyan-Zhou

Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

- Use "${REPO_ROOT}/python/pyproject.toml" instead of relative path so the
  version probe doesn't depend on the working directory the script is
  launched from (per gemini-code-assist review).
- Bump nvidia-cutlass-dsl[cu13] 4.5.0 -> 4.5.1 now that the wheel-mix
  TypeError is mitigated by force_reinstall_cutlass_dsl_libs_cu13. This
  re-applies sgl-project#25576 which was rolled back in sgl-project#25938 only because of the
  install-order bug.
@mmangkad

mmangkad commented May 21, 2026

Copy link
Copy Markdown
Collaborator

@mmangkad I think your suggestion is correct, thanks for sharing it!

Yeah that was the issue because the order of install matters, not the version. Could we include the upgrade back to 4.5.1 here? I just saw it

@Kangyan-Zhou Kangyan-Zhou merged commit caa9f08 into sgl-project:main May 21, 2026
253 of 332 checks passed
Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026
…-mix TypeError (sgl-project#25958)

Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
alisonshao added a commit that referenced this pull request May 27, 2026
#25958's wheel-mix fix (force_reinstall_cutlass_dsl_libs_cu13) is what
solves the install-time TypeError; the accompanying 4.5.0->4.5.1 bump
isn't required for the fix and reintroduces a runtime regression.

py-spy on a hanging b200 test (DeepSeek-V3.2-NVFP4 + DSA + EAGLE) shows
the scheduler stuck in fp4_gemm autotune at:

  cutlass/cute/nvgpu/tcgen05/mma.py:557
    -> findsource (inspect.py:997)
    -> [flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py loop body]

The per-kernel-emission inspect.findsource walk is O(N) over loaded
modules and never finishes in 4.5.1 within the 30-min step budget.
4.5.0 doesn't hit this path (per main running this test cleanly).

Holding at 4.5.0 keeps us aligned with the prior team-wide revert
(#25938) while keeping the install-order safety helper from #25958.
mqhc2020 pushed a commit to mqhc2020/sglang that referenced this pull request Jun 2, 2026
…-mix TypeError (sgl-project#25958)

Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Jun 4, 2026
…-mix TypeError (sgl-project#25958)

Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Jyothirmaikottu added a commit to aws/deep-learning-containers that referenced this pull request Jun 9, 2026
PyTorch 2.11.0+cu130 bundles an older nvidia-cutlass-dsl that has
incompatible MLIR bindings with FlashInfer 0.6.11.post1's rmsnorm_cute
kernel. Force-reinstall cutlass-dsl>=4.5.2 after torch re-pin to ensure
compatible GPUModuleOp API during CUDA graph capture.

Upstream SGLang applies the same fix (sgl-project/sglang#25958).
Jyothirmaikottu added a commit to aws/deep-learning-containers that referenced this pull request Jun 11, 2026
* add sglang amzn2023 autorelease

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* update sglang to cuda 13

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* update build script

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix tagging

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix dockerfile path

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add cuda ref

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* update cron

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix quoting

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix linking

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* clean tags

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* ad sglang amzn2023 allowlist

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix telemetry

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix telemetry

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* update model throughput threshold

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* udpate sglang port

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* mode port killing mechanism

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* kill port 30000

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add port randomization

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* revert port to 8000

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* revert port

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* debug port logs

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* check ports

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix debug

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix debug

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* revert temp changes

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* revert debug statemtns

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix: move FP8 and large models to H100 runners, allowlist CVE-2026-42504

FP8 models (qwen3.5-35b-a3b-fp8, qwen3-coder-next-fp8) require fp8e4nv
which is only supported on Hopper (sm_90+). The gpu-l40s-4gpu-runners
label doesn't exist, causing fallback to gpu-efa-runners (A100 sm_80).
LLaMA 3.3 70B OOMs on A100 runners. Move all three to gpu-h100-8gpu-runners
with tp=8 and appropriate memory settings.

Add CVE-2026-42504 to security allowlist — go/stdlib MIME header CPU
exhaustion in mooncake libetcd_wrapper.so, same root cause as existing
Go stdlib entries.

* fix: pin nvidia-cutlass-dsl>=4.5.2 to fix FlashInfer CUTE rmsnorm crash

PyTorch 2.11.0+cu130 bundles an older nvidia-cutlass-dsl that has
incompatible MLIR bindings with FlashInfer 0.6.11.post1's rmsnorm_cute
kernel. Force-reinstall cutlass-dsl>=4.5.2 after torch re-pin to ensure
compatible GPUModuleOp API during CUDA graph capture.

Upstream SGLang applies the same fix (sgl-project/sglang#25958).

* fix: revert FP8 MoE models to tp=4 and move qwen3-32b to dedicated pod

Benchmark run 27228675384 surfaced three distinct failures:

- qwen3.5-35b-a3b-fp8 / qwen3-coder-next-fp8: tp=8 shards the FP8 MoE
  gate/up output_size to 64, which is not divisible by block_n=128
  ("output_size ... not divisible by weight quantization block_n=128").
  Revert to tp=4 — the intended sharding for these FP8 models.

- qwen3-32b: shared gpu-efa-runners pod had a leftover process holding
  port 8000 ("address already in use" -> warmup timeout). Move to a
  dedicated gpu-h100-8gpu-runners pod to avoid the collision.

llama-3.3-70b stays at tp=8 (dense model, no block-quant constraint,
needs the memory headroom).

* fix: use default SGLang port for GPU benchmarks and disable piecewise CUDA graph

All gpu-h100-8gpu-runners benchmark jobs failed at server startup with
'[Errno 98] address already in use' on port 8000; port 8000 is occupied
on those pods. Remove the SGLANG_PORT=8000 override from the five GPU
models so they use the SGLang default (30000), matching the x86 jobs
that already pass.

Also add --disable-piecewise-cuda-graph to qwen3-32b: it crashed during
warmup_compile with 'FusedAddRMSNorm ... illegal memory access' while
capturing the experimental piecewise CUDA graph (same workaround as
llama-3.3-70b).

---------

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Co-authored-by: Jyothirmai Kottu <jkottu@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bypass-fastfail dependencies Pull requests that update a dependency file run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants