Fix CUDA 13 cudaMemcpyBatchAsync segfault and restore hicache CI by Kangyan-Zhou · Pull Request #23183 · sgl-project/sglang

Kangyan-Zhou · 2026-04-19T20:37:54Z

Supersedes #23172 (closed) — pushed to the upstream repo instead of a fork so that test_parallel_dispatch=true and skip_stage_health_check=true can be set on the pr-test run.

Motivation

Three related CI gaps are holding back the cu12 → cu13 migration landed in #23119. This PR fixes all three so the affected tests can run in CI on CUDA 13 without regressions.

1. `cudaMemcpyBatchAsync` segfault on CUDA 13 (ports #23136 from @yhyang201)

CUDA 13.0 removed the failIdx parameter from cudaMemcpyBatchAsync (9 params → 8). The dlsym path in sgl-kernel/csrc/kvcacheio/transfer.cu was hard-coded to the CUDA 12.8 signature, so on cu13 the stream argument landed in the wrong slot and the runtime segfaulted inside cuMemcpyBatchAsync_v2. Fix: dispatch between the v12 and v13 signatures at runtime.

Importantly, the signature selection must follow the runtime (cudaRuntimeGetVersion), not the driver (cudaDriverGetVersion). The ABI of the symbol is owned by the libcudart actually loaded into the process — a cu12 runtime on a cu13-capable host driver (common in containers) still exposes the 9-param v12 variant, and dispatching on the driver would segfault in that case. Reproduced on lmsysorg/sglang:dev (cu12.9) on a cu13 host driver:

cudaDriverGetVersion()  = 13000
cudaRuntimeGetVersion() = 12090
v12 dispatch of dlsym'd symbol: cudaSuccess, exit 0
v13 dispatch of dlsym'd symbol: Segmentation fault (core dumped)

2. Restore hicache tests to the CI-registered suite

PR #23119 moved seven hicache tests from test/registered/ to test/manual/ because they segfaulted on cu13. With (1) fixed, they move back:

hicache/test_hicache_storage.py
hicache/test_hicache_storage_3fs_backend.py
hicache/test_hicache_storage_file_backend.py
hicache/test_hicache_storage_mooncake_backend.py (also restores the register_cuda_ci(est_time=236, suite="stage-b-test-2-gpu-large") call that was dropped on the way to manual)
hicache/test_hicache_storage_runtime_attach_detach.py
hicache/test_hicache_variants.py
4-gpu-models/test_qwen35_hicache.py

All tests pass end-to-end on cu13 H200 with the fixed wheel (see validation section).

3. CI install/cleanup hygiene

Two install-path failures uncovered while rebuilding the PR's own CI:

flashinfer/data/ EEXIST on uv pip install for USE_VENV=false jobs (stage-a-test-1-gpu-small): uv pip uninstall flashinfer-python leaves flashinfer/data/ behind when flashinfer-cubin is kept, and the next reinstall hits File exists (os error 17). Fix in ci_install_dependency.sh purges the residual tree right after the uninstall and forces cubin to reinstall; ci_cleanup_venv.sh adds a post-job sweep as a belt-and-braces safety net so the next job's runner also starts clean.
Multiple libcudart libraries found: libcudart.so.12 and libcudart.so.13 from cudnn_frontend_shim.h on the SM120 (RTX 5090) runners. An orphan nvidia-cuda-runtime-cu12 wheel leftover from the pre-[CI] Add per-job uv venv isolation and upgrade CI version to Cuda 13 #23119 cu129 era is still shipping libcudart.so.12 under nvidia/cuda_runtime/lib/ next to cu13's nvidia/cu13/lib/libcudart.so.13; both end up on LD_LIBRARY_PATH and cudnn_frontend's dlopen probe throws. This is a pre-existing failure on main (same error on run 24635819338 commit 32b7777f), but carrying the fix here lets CI on this PR go green. Fix: pip uninstall -y nvidia-cuda-runtime-cu12 after the main install. A blunter sweep of all nvidia-*-cu12 would break torch (several cu12/cu13 wheel pairs share nvidia/<name>/lib/ dirs and uninstalling one wipes files the other's RECORD still references); the cu12 cuda_runtime wheel's install dir is disjoint from cu13's so this is safe.

Modifications

sgl-kernel/csrc/kvcacheio/transfer.cu: v12 vs v13 cudaMemcpyBatchAsync signature dispatch using cudaRuntimeGetVersion.
test/registered/hicache/* and test/registered/4-gpu-models/test_qwen35_hicache.py: restored from test/manual/.
scripts/ci/cuda/ci_install_dependency.sh: residual flashinfer/ tree purge + orphan nvidia-cuda-runtime-cu12 uninstall.
scripts/ci/cuda/ci_cleanup_venv.sh: post-job flashinfer cleanup for USE_VENV=false jobs.

Validation on cu13 H200 (ion-user-9, `lmsysorg/sglang:dev-cu13` with this PR's wheels)

Test	Status	Time
`test_hicache_storage.py`	✅	150s (MMLU 0.734)
`test_hicache_storage_runtime_attach_detach.py`	✅	158s
`test_hicache_storage_file_backend.TestHiCache`	✅	—
`test_hicache_storage_file_backend.TestHiCacheStoragePageFirstLayout`	✅	—
`test_hicache_storage_file_backend.TestHiCacheStoragePageFirstDirectIO`	✅	116s — direct validation of the `cudaMemcpyBatchAsync` path
`test_hicache_storage_file_backend.TestHiCacheStorageAccuracy`	✅	— (acc diff 0.0000)
`test_hicache_storage_file_backend.TestHiCacheStorageMLA`	✅	97s (standalone)
`test_hicache_variants.TestHiCacheStandard`	✅	MMLU 0.703
`test_hicache_variants.TestHiCacheMLA`	✅	MMLU 0.578
`test_hicache_variants.TestHiCachePage`	✅	MMLU 0.75
`test_hicache_variants.TestHiCacheEagle`	✅	111s (needs `SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1`)
`4-gpu-models/test_qwen35_hicache.py`	✅	(TP=4 on H200, storage on `/data`)

Cross-shout to @yhyang201 — the original fix in #23136 is the foundation of this PR.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests. (Hicache tests restored from test/manual/ to test/registered/.)
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy — MMLU + GSM8K validated on the restored tests.
Follow the SGLang code style guidance.

cc @Fridge003 @alisonshao

Port PR #23136 (Yuhao Yang): cudaMemcpyBatchAsync lost its failIdx parameter in CUDA 13, so the dlsym-based call was passing the stream handle at the wrong slot and segfaulting inside cuMemcpyBatchAsync_v2. Use driver_version at runtime to dispatch to either the CUDA 12 or CUDA 13 signature. With the segfault fixed, move the 7 hicache tests that were parked under test/manual in PR #23119 and subsequent cu13 flake sweeps back into test/registered so they run in CI again: - hicache/test_hicache_storage.py - hicache/test_hicache_storage_3fs_backend.py - hicache/test_hicache_storage_file_backend.py - hicache/test_hicache_storage_mooncake_backend.py - hicache/test_hicache_storage_runtime_attach_detach.py - hicache/test_hicache_variants.py - 4-gpu-models/test_qwen35_hicache.py TODO "move back after fixed" docstrings are stripped and the register_cuda_ci call that was dropped from the mooncake backend test on its way to manual is restored. Co-Authored-By: Yuhao Yang <yhyang201@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a USE_VENV=false runner had flashinfer-cubin installed ("already installed, keeping it"), `uv pip uninstall flashinfer-python` left the flashinfer/data/ subdirectory behind (cubin files still owned entries below it). The next `uv pip install -e python[dev,runai,tracing]` then failed with: error: Failed to install: flashinfer_python-0.6.7.post3-py3-none-any.whl Caused by: failed to create directory `/usr/local/lib/python3.10/dist-packages/flashinfer/data`: File exists Seen on stage-a-test-1-gpu-small in https://github.com/sgl-project/sglang/actions/runs/24634237642/job/72027123887 Two-layer fix: 1. ci_install_dependency.sh (in-flight safeguard): right after the flashinfer uninstall step, if <site-packages>/flashinfer/ still exists, rm -rf it and force flashinfer-cubin to reinstall. `uv pip install -e python[...]` then resolves both flashinfer-python and flashinfer-cubin (both declared in pyproject.toml) and repopulates flashinfer/data/ cleanly. This makes the PR self-healing on its first run without depending on a prior job's post-cleanup. 2. ci_cleanup_venv.sh (post-job hygiene): the USE_VENV=false arm used to `exit 0` immediately. It now uninstalls the flashinfer trio and purges residual flashinfer/, flashinfer_cubin/, flashinfer_jit_cache/ trees from system site-packages so the next job's runner starts clean even if the in-flight safeguard ever regresses. Cached wheels under ~/.cache/flashinfer-wheels/ keep the reinstall fast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses the review on #23172: #23172 (comment) cudaMemcpyBatchAsync is a libcudart (runtime) symbol; the ABI of the function dlsym'd into this process is owned by the libcudart that's actually loaded, not by the host's kernel driver. Dispatching on cudaDriverGetVersion() breaks in the common container case where a cu12 runtime is paired with a cu13-capable host driver: driver=13000 steers us to the 8-param v13 call, but the symbol resolves to v12 (9 params with failIdx), so the stream argument lands in a wrong slot and we segfault — the exact crash this fix was supposed to prevent. Reproduced on ion-user-9 with lmsysorg/sglang:dev (cu12.9 runtime): cudaDriverGetVersion() = 13000 cudaRuntimeGetVersion() = 12090 v12 dispatch of dlsym'd symbol: cudaSuccess, exit 0 v13 dispatch of dlsym'd symbol: Segmentation fault (core dumped) Switching the signature-selection to cudaRuntimeGetVersion makes the choice follow the loaded libcudart, which is what actually determines the ABI. The existing cudaDriverGetVersion guard above is kept — it remains the right knob for the capability check since cudaMemcpyBatch requires a 12.8+ driver regardless of the runtime version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After the main `uv pip install -e python[...]` step, runners that carried state from the pre-#23119 (cu129) era keep `nvidia-cuda-runtime-cu12` installed as an orphan (Required-by: empty) alongside the cu13 runtime. Its libcudart.so.12 sits under `nvidia/cuda_runtime/lib/` while cu13's lives under `nvidia/cu13/lib/`. Both dirs end up on LD_LIBRARY_PATH, so cudnn_frontend_shim.h's probe for lib in ["libcudart.so.12", "libcudart.so.13"]: dlopen(lib) loads both and throws: RuntimeError: Multiple libcudart libraries found: libcudart.so.12 and libcudart.so.13 Tests hit this during server setUpClass → CUDA graph capture (e.g. test_nvfp4_gemm_sm120.py on stage-b-test-1-gpu-small). The same failure reproduces on main, so this is not PR-specific — it's a leftover cleanup step the cu13 migration missed. Fix: uninstall nvidia-cuda-runtime-cu12 right after the main install. Its install dir is disjoint from cu13's so the uninstall doesn't touch any files shared with cu13 packages (a blunter sweep of all `nvidia-*-cu12` breaks torch because several pairs share dirs under `nvidia/<name>/lib/` and uninstalling one deletes files that the cu13 variant still references through its RECORD). Reproduced and verified on 5090-novita-ci-runner-d (runner-1 container): before: libcudart.so.12 + libcudart.so.13 both loadable after : only libcudart.so.13 loadable, torch.cuda.randn works Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request addresses CUDA 13 migration issues by improving CI environment cleanup and implementing dynamic dispatch for CUDA API calls. CI scripts were updated to purge stale flashinfer and legacy nvidia-cuda-runtime-cu12 packages, preventing installation failures and library conflicts. In the sgl-kernel, the code now detects the CUDA runtime version to correctly call cudaMemcpyBatchAsync, which had its signature changed in CUDA 13. These changes allow for the re-enabling of several tests that were previously failing. Feedback was provided to optimize the runtime version check by making it static to avoid redundant API calls in the hot path.

gemini-code-assist · 2026-04-19T20:39:09Z

+  int runtime_version = 0;
+  cudaError_t runtime_version_err = cudaRuntimeGetVersion(&runtime_version);
+  if (runtime_version_err != cudaSuccess) {
+    fallback_to_page_copy();
+    return;
+  }
+  const bool use_v13_signature = runtime_version >= 13000;


The CUDA runtime version check is performed on every call to transfer_kv_page_first_direct_impl. Since the runtime version is constant for the duration of the process, these variables should be declared static to avoid redundant API calls in the hot path of KV cache transfers.

static int runtime_version = 0; static cudaError_t runtime_version_err = cudaRuntimeGetVersion(&runtime_version); if (runtime_version_err != cudaSuccess) { fallback_to_page_copy(); return; } static const bool use_v13_signature = runtime_version >= 13000;

Two PR-local changes: 1. sgl-kernel/csrc/kvcacheio/transfer.cu: address code-review feedback (#23183 (comment)). The runtime version is constant for the process lifetime, so cache the cudaRuntimeGetVersion result and the derived use_v13_signature as static locals (thread-safe static init in C++11+). Keeps the KV-transfer hot path free of a redundant runtime-API call per invocation. 2. .github/workflows/pr-test.yml: local override so this branch exercises the restored hicache suite end-to-end on cu13 without stage-a fast- failing the rest of the run: - SKIP_STAGE_HEALTH_CHECK: hard-coded to 'true' - wait-for-stage-a / wait-for-stage-b: `if:` gated with `false &&` so every stage dispatches in parallel (mimics the scheduled-run path). Downstream stage jobs already accept `wait-for-stage-*.result == 'skipped'`, so nothing else needs to change. REVERT THESE WORKFLOW CHANGES BEFORE MERGE — they are only here to unblock validation on this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GitHub Actions expression language doesn't accept # comments inline; the previous commit put them on the same line as `false &&`, which made the whole workflow fail to load ('This run likely failed because of a workflow file issue', no jobs dispatched). Move the override context to YAML-level comments above each wait-for-stage block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kangyan-Zhou · 2026-04-19T21:00:11Z

/tag-and-rerun-ci

Reframe the three defensive cleanups in ci_install_dependency.sh and ci_cleanup_venv.sh around the 'long-lived runner state' invariant, so future maintainers don't misread them as incident-specific workarounds and delete them. Content is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses code-review feedback on the sibling PR sgl-project#23183: sgl-project#23183 (comment) The runtime version is constant for the process lifetime, so cache the cudaRuntimeGetVersion result and the derived use_v13_signature as static locals (thread-safe static init in C++11+). Keeps the KV-transfer hot path free of a redundant runtime-API call per invocation. Other diff in this commit is clang-format reflowing the v12/v13 dlsym call sites to the repo's column-limit style — no semantic change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kangyan-Zhou and others added 4 commits April 19, 2026 09:53

Kangyan-Zhou requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy and yizhang2077 as code owners April 19, 2026 20:37

github-actions Bot added hicache Hierarchical Caching for SGLang sgl-kernel labels Apr 19, 2026

gemini-code-assist Bot reviewed Apr 19, 2026

View reviewed changes

Kangyan-Zhou requested review from Fridge003 and bingxche as code owners April 19, 2026 20:42

Kangyan-Zhou added the run-ci label Apr 19, 2026

yhyang201 mentioned this pull request Apr 20, 2026

Fix segfault in cudaMemcpyBatchAsync on CUDA 13.0 #23136

Merged

5 tasks

Kangyan-Zhou closed this Apr 20, 2026

Kangyan-Zhou mentioned this pull request Apr 20, 2026

[CI] Clean up stale cu12 packages before cu13 installs #23295

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA 13 cudaMemcpyBatchAsync segfault and restore hicache CI#23183

Fix CUDA 13 cudaMemcpyBatchAsync segfault and restore hicache CI#23183
Kangyan-Zhou wants to merge 7 commits intomainfrom
cuda13_memcpy_hicache_restore

Kangyan-Zhou commented Apr 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 19, 2026

Uh oh!

Kangyan-Zhou commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kangyan-Zhou commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

1. cudaMemcpyBatchAsync segfault on CUDA 13 (ports #23136 from @yhyang201)

2. Restore hicache tests to the CI-registered suite

3. CI install/cleanup hygiene

Modifications

Validation on cu13 H200 (ion-user-9, lmsysorg/sglang:dev-cu13 with this PR's wheels)

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Kangyan-Zhou commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kangyan-Zhou commented Apr 19, 2026 •

edited

Loading

1. `cudaMemcpyBatchAsync` segfault on CUDA 13 (ports #23136 from @yhyang201)

Validation on cu13 H200 (ion-user-9, `lmsysorg/sglang:dev-cu13` with this PR's wheels)