Fix CUDA 13 cudaMemcpyBatchAsync segfault and restore hicache CI#23183
Fix CUDA 13 cudaMemcpyBatchAsync segfault and restore hicache CI#23183Kangyan-Zhou wants to merge 7 commits intomainfrom
Conversation
Port PR #23136 (Yuhao Yang): cudaMemcpyBatchAsync lost its failIdx parameter in CUDA 13, so the dlsym-based call was passing the stream handle at the wrong slot and segfaulting inside cuMemcpyBatchAsync_v2. Use driver_version at runtime to dispatch to either the CUDA 12 or CUDA 13 signature. With the segfault fixed, move the 7 hicache tests that were parked under test/manual in PR #23119 and subsequent cu13 flake sweeps back into test/registered so they run in CI again: - hicache/test_hicache_storage.py - hicache/test_hicache_storage_3fs_backend.py - hicache/test_hicache_storage_file_backend.py - hicache/test_hicache_storage_mooncake_backend.py - hicache/test_hicache_storage_runtime_attach_detach.py - hicache/test_hicache_variants.py - 4-gpu-models/test_qwen35_hicache.py TODO "move back after fixed" docstrings are stripped and the register_cuda_ci call that was dropped from the mooncake backend test on its way to manual is restored. Co-Authored-By: Yuhao Yang <yhyang201@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a USE_VENV=false runner had flashinfer-cubin installed
("already installed, keeping it"), `uv pip uninstall flashinfer-python`
left the flashinfer/data/ subdirectory behind (cubin files still owned
entries below it). The next `uv pip install -e python[dev,runai,tracing]`
then failed with:
error: Failed to install: flashinfer_python-0.6.7.post3-py3-none-any.whl
Caused by: failed to create directory
`/usr/local/lib/python3.10/dist-packages/flashinfer/data`: File exists
Seen on stage-a-test-1-gpu-small in
https://github.com/sgl-project/sglang/actions/runs/24634237642/job/72027123887
Two-layer fix:
1. ci_install_dependency.sh (in-flight safeguard): right after the
flashinfer uninstall step, if <site-packages>/flashinfer/ still
exists, rm -rf it and force flashinfer-cubin to reinstall.
`uv pip install -e python[...]` then resolves both flashinfer-python
and flashinfer-cubin (both declared in pyproject.toml) and repopulates
flashinfer/data/ cleanly. This makes the PR self-healing on its
first run without depending on a prior job's post-cleanup.
2. ci_cleanup_venv.sh (post-job hygiene): the USE_VENV=false arm used
to `exit 0` immediately. It now uninstalls the flashinfer trio and
purges residual flashinfer/, flashinfer_cubin/, flashinfer_jit_cache/
trees from system site-packages so the next job's runner starts
clean even if the in-flight safeguard ever regresses. Cached wheels
under ~/.cache/flashinfer-wheels/ keep the reinstall fast.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses the review on #23172: #23172 (comment) cudaMemcpyBatchAsync is a libcudart (runtime) symbol; the ABI of the function dlsym'd into this process is owned by the libcudart that's actually loaded, not by the host's kernel driver. Dispatching on cudaDriverGetVersion() breaks in the common container case where a cu12 runtime is paired with a cu13-capable host driver: driver=13000 steers us to the 8-param v13 call, but the symbol resolves to v12 (9 params with failIdx), so the stream argument lands in a wrong slot and we segfault — the exact crash this fix was supposed to prevent. Reproduced on ion-user-9 with lmsysorg/sglang:dev (cu12.9 runtime): cudaDriverGetVersion() = 13000 cudaRuntimeGetVersion() = 12090 v12 dispatch of dlsym'd symbol: cudaSuccess, exit 0 v13 dispatch of dlsym'd symbol: Segmentation fault (core dumped) Switching the signature-selection to cudaRuntimeGetVersion makes the choice follow the loaded libcudart, which is what actually determines the ABI. The existing cudaDriverGetVersion guard above is kept — it remains the right knob for the capability check since cudaMemcpyBatch requires a 12.8+ driver regardless of the runtime version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the main `uv pip install -e python[...]` step, runners that carried state from the pre-#23119 (cu129) era keep `nvidia-cuda-runtime-cu12` installed as an orphan (Required-by: empty) alongside the cu13 runtime. Its libcudart.so.12 sits under `nvidia/cuda_runtime/lib/` while cu13's lives under `nvidia/cu13/lib/`. Both dirs end up on LD_LIBRARY_PATH, so cudnn_frontend_shim.h's probe for lib in ["libcudart.so.12", "libcudart.so.13"]: dlopen(lib) loads both and throws: RuntimeError: Multiple libcudart libraries found: libcudart.so.12 and libcudart.so.13 Tests hit this during server setUpClass → CUDA graph capture (e.g. test_nvfp4_gemm_sm120.py on stage-b-test-1-gpu-small). The same failure reproduces on main, so this is not PR-specific — it's a leftover cleanup step the cu13 migration missed. Fix: uninstall nvidia-cuda-runtime-cu12 right after the main install. Its install dir is disjoint from cu13's so the uninstall doesn't touch any files shared with cu13 packages (a blunter sweep of all `nvidia-*-cu12` breaks torch because several pairs share dirs under `nvidia/<name>/lib/` and uninstalling one deletes files that the cu13 variant still references through its RECORD). Reproduced and verified on 5090-novita-ci-runner-d (runner-1 container): before: libcudart.so.12 + libcudart.so.13 both loadable after : only libcudart.so.13 loadable, torch.cuda.randn works Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request addresses CUDA 13 migration issues by improving CI environment cleanup and implementing dynamic dispatch for CUDA API calls. CI scripts were updated to purge stale flashinfer and legacy nvidia-cuda-runtime-cu12 packages, preventing installation failures and library conflicts. In the sgl-kernel, the code now detects the CUDA runtime version to correctly call cudaMemcpyBatchAsync, which had its signature changed in CUDA 13. These changes allow for the re-enabling of several tests that were previously failing. Feedback was provided to optimize the runtime version check by making it static to avoid redundant API calls in the hot path.
| int runtime_version = 0; | ||
| cudaError_t runtime_version_err = cudaRuntimeGetVersion(&runtime_version); | ||
| if (runtime_version_err != cudaSuccess) { | ||
| fallback_to_page_copy(); | ||
| return; | ||
| } | ||
| const bool use_v13_signature = runtime_version >= 13000; |
There was a problem hiding this comment.
The CUDA runtime version check is performed on every call to transfer_kv_page_first_direct_impl. Since the runtime version is constant for the duration of the process, these variables should be declared static to avoid redundant API calls in the hot path of KV cache transfers.
static int runtime_version = 0;
static cudaError_t runtime_version_err = cudaRuntimeGetVersion(&runtime_version);
if (runtime_version_err != cudaSuccess) {
fallback_to_page_copy();
return;
}
static const bool use_v13_signature = runtime_version >= 13000;
Two PR-local changes: 1. sgl-kernel/csrc/kvcacheio/transfer.cu: address code-review feedback (#23183 (comment)). The runtime version is constant for the process lifetime, so cache the cudaRuntimeGetVersion result and the derived use_v13_signature as static locals (thread-safe static init in C++11+). Keeps the KV-transfer hot path free of a redundant runtime-API call per invocation. 2. .github/workflows/pr-test.yml: local override so this branch exercises the restored hicache suite end-to-end on cu13 without stage-a fast- failing the rest of the run: - SKIP_STAGE_HEALTH_CHECK: hard-coded to 'true' - wait-for-stage-a / wait-for-stage-b: `if:` gated with `false &&` so every stage dispatches in parallel (mimics the scheduled-run path). Downstream stage jobs already accept `wait-for-stage-*.result == 'skipped'`, so nothing else needs to change. REVERT THESE WORKFLOW CHANGES BEFORE MERGE — they are only here to unblock validation on this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Actions expression language doesn't accept # comments inline;
the previous commit put them on the same line as `false &&`, which
made the whole workflow fail to load ('This run likely failed because
of a workflow file issue', no jobs dispatched). Move the override
context to YAML-level comments above each wait-for-stage block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/tag-and-rerun-ci |
Reframe the three defensive cleanups in ci_install_dependency.sh and ci_cleanup_venv.sh around the 'long-lived runner state' invariant, so future maintainers don't misread them as incident-specific workarounds and delete them. Content is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses code-review feedback on the sibling PR sgl-project#23183: sgl-project#23183 (comment) The runtime version is constant for the process lifetime, so cache the cudaRuntimeGetVersion result and the derived use_v13_signature as static locals (thread-safe static init in C++11+). Keeps the KV-transfer hot path free of a redundant runtime-API call per invocation. Other diff in this commit is clang-format reflowing the v12/v13 dlsym call sites to the repo's column-limit style — no semantic change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Supersedes #23172 (closed) — pushed to the upstream repo instead of a fork so that
test_parallel_dispatch=trueandskip_stage_health_check=truecan be set on the pr-test run.Motivation
Three related CI gaps are holding back the cu12 → cu13 migration landed in #23119. This PR fixes all three so the affected tests can run in CI on CUDA 13 without regressions.
1.
cudaMemcpyBatchAsyncsegfault on CUDA 13 (ports #23136 from @yhyang201)CUDA 13.0 removed the
failIdxparameter fromcudaMemcpyBatchAsync(9 params → 8). The dlsym path insgl-kernel/csrc/kvcacheio/transfer.cuwas hard-coded to the CUDA 12.8 signature, so on cu13 the stream argument landed in the wrong slot and the runtime segfaulted insidecuMemcpyBatchAsync_v2. Fix: dispatch between the v12 and v13 signatures at runtime.Importantly, the signature selection must follow the runtime (
cudaRuntimeGetVersion), not the driver (cudaDriverGetVersion). The ABI of the symbol is owned by thelibcudartactually loaded into the process — a cu12 runtime on a cu13-capable host driver (common in containers) still exposes the 9-param v12 variant, and dispatching on the driver would segfault in that case. Reproduced onlmsysorg/sglang:dev(cu12.9) on a cu13 host driver:2. Restore hicache tests to the CI-registered suite
PR #23119 moved seven hicache tests from
test/registered/totest/manual/because they segfaulted on cu13. With (1) fixed, they move back:hicache/test_hicache_storage.pyhicache/test_hicache_storage_3fs_backend.pyhicache/test_hicache_storage_file_backend.pyhicache/test_hicache_storage_mooncake_backend.py(also restores theregister_cuda_ci(est_time=236, suite="stage-b-test-2-gpu-large")call that was dropped on the way to manual)hicache/test_hicache_storage_runtime_attach_detach.pyhicache/test_hicache_variants.py4-gpu-models/test_qwen35_hicache.pyAll tests pass end-to-end on cu13 H200 with the fixed wheel (see validation section).
3. CI install/cleanup hygiene
Two install-path failures uncovered while rebuilding the PR's own CI:
flashinfer/data/EEXIST onuv pip installforUSE_VENV=falsejobs (stage-a-test-1-gpu-small):uv pip uninstall flashinfer-pythonleavesflashinfer/data/behind whenflashinfer-cubinis kept, and the next reinstall hitsFile exists (os error 17). Fix inci_install_dependency.shpurges the residual tree right after the uninstall and forces cubin to reinstall;ci_cleanup_venv.shadds a post-job sweep as a belt-and-braces safety net so the next job's runner also starts clean.Multiple libcudart libraries found: libcudart.so.12 and libcudart.so.13fromcudnn_frontend_shim.hon the SM120 (RTX 5090) runners. An orphannvidia-cuda-runtime-cu12wheel leftover from the pre-[CI] Add per-job uv venv isolation and upgrade CI version to Cuda 13 #23119 cu129 era is still shippinglibcudart.so.12undernvidia/cuda_runtime/lib/next to cu13'snvidia/cu13/lib/libcudart.so.13; both end up onLD_LIBRARY_PATHand cudnn_frontend's dlopen probe throws. This is a pre-existing failure on main (same error on run24635819338commit32b7777f), but carrying the fix here lets CI on this PR go green. Fix:pip uninstall -y nvidia-cuda-runtime-cu12after the main install. A blunter sweep of allnvidia-*-cu12would break torch (several cu12/cu13 wheel pairs sharenvidia/<name>/lib/dirs and uninstalling one wipes files the other's RECORD still references); the cu12 cuda_runtime wheel's install dir is disjoint from cu13's so this is safe.Modifications
sgl-kernel/csrc/kvcacheio/transfer.cu: v12 vs v13cudaMemcpyBatchAsyncsignature dispatch usingcudaRuntimeGetVersion.test/registered/hicache/*andtest/registered/4-gpu-models/test_qwen35_hicache.py: restored fromtest/manual/.scripts/ci/cuda/ci_install_dependency.sh: residualflashinfer/tree purge + orphannvidia-cuda-runtime-cu12uninstall.scripts/ci/cuda/ci_cleanup_venv.sh: post-job flashinfer cleanup forUSE_VENV=falsejobs.Validation on cu13 H200 (ion-user-9,
lmsysorg/sglang:dev-cu13with this PR's wheels)test_hicache_storage.pytest_hicache_storage_runtime_attach_detach.pytest_hicache_storage_file_backend.TestHiCachetest_hicache_storage_file_backend.TestHiCacheStoragePageFirstLayouttest_hicache_storage_file_backend.TestHiCacheStoragePageFirstDirectIOcudaMemcpyBatchAsyncpathtest_hicache_storage_file_backend.TestHiCacheStorageAccuracytest_hicache_storage_file_backend.TestHiCacheStorageMLAtest_hicache_variants.TestHiCacheStandardtest_hicache_variants.TestHiCacheMLAtest_hicache_variants.TestHiCachePagetest_hicache_variants.TestHiCacheEagleSGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1)4-gpu-models/test_qwen35_hicache.py/data)Cross-shout to @yhyang201 — the original fix in #23136 is the foundation of this PR.
Checklist
test/manual/totest/registered/.)cc @Fridge003 @alisonshao