[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958
Conversation
…-mix TypeError
nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base and
-libs-cu13 are installed and they ship intentionally-different content
for the same site-packages paths:
cutlass/_mlir/dialects/_gpu_ops_gen.py
cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-*.so
Each wrapper .py is paired with a matching pybind11 .so. The two pairs
use different MLIR Op constructor styles:
-libs-base: super().__init__(self.build_generic(...)) (new-style)
-libs-cu13: super().__init__(OPERATION_NAME, REGIONS, ...) (old-style)
If install order leaves the .py from one wheel and the .so from the
other (reproducible by mixing the wheel contents), the wrapper's
super().__init__ call signature does not match what the loaded .so
accepts and the runtime raises:
TypeError: __init__(): incompatible function arguments.
1. __init__(self, operation: object) -> None
surfacing at kernel-compile time on H100 CU13 CI runners during eagle /
lora tests that go through flashinfer.rmsnorm_cute -> cute.compile.
Tested all 4 (.py, .so) combinations on an H200 devbox: only the
mismatched '.py=cu13 + .so=base' fails, producing the exact CI TypeError
byte-for-byte. Three combinations pass.
Fix: after install_sglang completes (with possibly mismatched state),
force-reinstall -libs-cu13 last so both .py and .so come from the same
wheel (BOTH-cu13 state). The version is parsed from pyproject.toml so
this stays in sync with whatever nvidia-cutlass-dsl version the project
pins. Skips for non-CU13 runners (no [cu13] extra, no conflict).
Verified on an H200 devbox:
1. TypeError fix: forced bad state, ran force_reinstall_cutlass_dsl_libs_cu13
-> smoke test went FAIL -> PASS, .so md5 changed from base's to cu13's.
2. LoRA regression check: ran test_lora_qwen3_8b_logprob_diff.py
-> both subtests passed, KL divergence 2.8e-4 (threshold 5e-3).
The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS
regression from sgl-project#25743.
de055b3 to
1a0dbf2
Compare
There was a problem hiding this comment.
Code Review
This pull request updates the nvidia-cutlass-dsl[cu13] dependency to version 4.5.1 and adds a force_reinstall_cutlass_dsl_libs_cu13 function to the CI installation script to prevent library mismatches. Feedback was provided to use the ${REPO_ROOT} variable for the pyproject.toml file path in the script to ensure it is correctly located regardless of the current working directory.
| return | ||
| fi | ||
|
|
||
| CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "") |
There was a problem hiding this comment.
Using a relative path for python/pyproject.toml makes the script's behavior dependent on the current working directory. Since REPO_ROOT is already defined and used elsewhere in this script for robustness, it should be used here as well. Additionally, quoting the path is a good practice to handle potential spaces in the directory name.
| CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "") | |
| CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\\[[^]]+\\])?==\\K[0-9A-Za-z\\.\\-]+' "${REPO_ROOT}/python/pyproject.toml" || echo "") |
|
@mmangkad I think your suggestion is correct, thanks for sharing it! |
|
/tag-and-rerun-ci |
- Use "${REPO_ROOT}/python/pyproject.toml" instead of relative path so the
version probe doesn't depend on the working directory the script is
launched from (per gemini-code-assist review).
- Bump nvidia-cutlass-dsl[cu13] 4.5.0 -> 4.5.1 now that the wheel-mix
TypeError is mitigated by force_reinstall_cutlass_dsl_libs_cu13. This
re-applies sgl-project#25576 which was rolled back in sgl-project#25938 only because of the
install-order bug.
Yeah that was the issue because the order of install matters, not the version. |
…-mix TypeError (sgl-project#25958) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
#25958's wheel-mix fix (force_reinstall_cutlass_dsl_libs_cu13) is what solves the install-time TypeError; the accompanying 4.5.0->4.5.1 bump isn't required for the fix and reintroduces a runtime regression. py-spy on a hanging b200 test (DeepSeek-V3.2-NVFP4 + DSA + EAGLE) shows the scheduler stuck in fp4_gemm autotune at: cutlass/cute/nvgpu/tcgen05/mma.py:557 -> findsource (inspect.py:997) -> [flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py loop body] The per-kernel-emission inspect.findsource walk is O(N) over loaded modules and never finishes in 4.5.1 within the 30-min step budget. 4.5.0 doesn't hit this path (per main running this test cleanly). Holding at 4.5.0 keeps us aligned with the prior team-wide revert (#25938) while keeping the install-order safety helper from #25958.
…-mix TypeError (sgl-project#25958) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
…-mix TypeError (sgl-project#25958) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
PyTorch 2.11.0+cu130 bundles an older nvidia-cutlass-dsl that has incompatible MLIR bindings with FlashInfer 0.6.11.post1's rmsnorm_cute kernel. Force-reinstall cutlass-dsl>=4.5.2 after torch re-pin to ensure compatible GPUModuleOp API during CUDA graph capture. Upstream SGLang applies the same fix (sgl-project/sglang#25958).
* add sglang amzn2023 autorelease Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * update sglang to cuda 13 Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * update build script Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix tagging Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix dockerfile path Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * add cuda ref Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * update cron Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix quoting Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix linking Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * clean tags Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * ad sglang amzn2023 allowlist Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix telemetry Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix telemetry Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * update model throughput threshold Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * udpate sglang port Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * mode port killing mechanism Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * kill port 30000 Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * add port randomization Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * revert port to 8000 Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * revert port Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * debug port logs Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * check ports Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix debug Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix debug Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * revert temp changes Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * revert debug statemtns Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * fix: move FP8 and large models to H100 runners, allowlist CVE-2026-42504 FP8 models (qwen3.5-35b-a3b-fp8, qwen3-coder-next-fp8) require fp8e4nv which is only supported on Hopper (sm_90+). The gpu-l40s-4gpu-runners label doesn't exist, causing fallback to gpu-efa-runners (A100 sm_80). LLaMA 3.3 70B OOMs on A100 runners. Move all three to gpu-h100-8gpu-runners with tp=8 and appropriate memory settings. Add CVE-2026-42504 to security allowlist — go/stdlib MIME header CPU exhaustion in mooncake libetcd_wrapper.so, same root cause as existing Go stdlib entries. * fix: pin nvidia-cutlass-dsl>=4.5.2 to fix FlashInfer CUTE rmsnorm crash PyTorch 2.11.0+cu130 bundles an older nvidia-cutlass-dsl that has incompatible MLIR bindings with FlashInfer 0.6.11.post1's rmsnorm_cute kernel. Force-reinstall cutlass-dsl>=4.5.2 after torch re-pin to ensure compatible GPUModuleOp API during CUDA graph capture. Upstream SGLang applies the same fix (sgl-project/sglang#25958). * fix: revert FP8 MoE models to tp=4 and move qwen3-32b to dedicated pod Benchmark run 27228675384 surfaced three distinct failures: - qwen3.5-35b-a3b-fp8 / qwen3-coder-next-fp8: tp=8 shards the FP8 MoE gate/up output_size to 64, which is not divisible by block_n=128 ("output_size ... not divisible by weight quantization block_n=128"). Revert to tp=4 — the intended sharding for these FP8 models. - qwen3-32b: shared gpu-efa-runners pod had a leftover process holding port 8000 ("address already in use" -> warmup timeout). Move to a dedicated gpu-h100-8gpu-runners pod to avoid the collision. llama-3.3-70b stays at tp=8 (dense model, no block-quant constraint, needs the memory headroom). * fix: use default SGLang port for GPU benchmarks and disable piecewise CUDA graph All gpu-h100-8gpu-runners benchmark jobs failed at server startup with '[Errno 98] address already in use' on port 8000; port 8000 is occupied on those pods. Remove the SGLANG_PORT=8000 override from the five GPU models so they use the SGLang default (30000), matching the x86 jobs that already pass. Also add --disable-piecewise-cuda-graph to qwen3-32b: it crashed during warmup_compile with 'FusedAddRMSNorm ... illegal memory access' while capturing the experimental piecewise CUDA graph (same workaround as llama-3.3-70b). --------- Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> Co-authored-by: Jyothirmai Kottu <jkottu@amazon.com>
Root cause
nvidia-cutlass-dsl[cu13]has additive PyPI extras — installing it pulls in bothnvidia-cutlass-dsl-libs-baseANDnvidia-cutlass-dsl-libs-cu13. The two wheels ship intentionally-different content for the same paths:-libs-base-libs-cu13cutlass/_mlir/dialects/_gpu_ops_gen.pysuper().__init__(self.build_generic(...))(new-style single object)super().__init__(OPERATION_NAME, REGIONS, ...)(old-style positional)cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-310-x86_64-linux-gnu.so(operation: object)Each wheel's
.pyis paired with a.sothat has the matching API. If install order leaves the.pyfrom one wheel and the.sofrom the other (which can happen viauv's install ordering), you get the hard TypeError seen in CI:This surfaces at kernel-compile time on CU13 CI runners during eagle / lora tests that go through
flashinfer.rmsnorm_cute→cute.compile.Empirical evidence
Tested all 4 combinations on an H200 devbox by manually
cp-ing wheel contents into site-packages:.pyfrom.sofromgpu.GPUModuleOp(StringAttr, loc=loc))-libs-base-libs-base-libs-cu13-libs-cu13-libs-cu13-libs-base-libs-base-libs-cu13Three of four states work. Only the mismatched
.py=cu13 + .so=basebreaks.Fix
After
install_sglangcompletes (with possibly mismatched state), force-reinstall-libs-cu13last to guarantee both.pyand.socome from the same wheel (BOTH-cu13 state):Version parsed from
pyproject.tomlto stay in sync. Skips for non-CU13 runners (only-libs-baseinstalled there, no conflict possible).Validation on devbox
UV_LINK_MODE=copy(matches CI), ranforce_reinstall_cutlass_dsl_libs_cu13— smoke test went FAIL → PASS,.somd5 changed from base's to cu13's.test/registered/lora/test_lora_qwen3_8b_logprob_diff.pyagainst the fix on the same devbox — both subtests passed, KL divergence2.8e-4(threshold5e-3). The fix does NOT re-trigger theCUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESSregression from Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main #25743.Related PRs / supersedes
🤖 Generated with Claude Code
CI States
Latest PR Test (Base): ❌ Run #26216901406
Latest PR Test (Extra): ❌ Run #26216901321