Skip to content

Fix CUDA 13 cudaMemcpyBatchAsync segfault and restore hicache CI#23183

Closed
Kangyan-Zhou wants to merge 7 commits intomainfrom
cuda13_memcpy_hicache_restore
Closed

Fix CUDA 13 cudaMemcpyBatchAsync segfault and restore hicache CI#23183
Kangyan-Zhou wants to merge 7 commits intomainfrom
cuda13_memcpy_hicache_restore

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

@Kangyan-Zhou Kangyan-Zhou commented Apr 19, 2026

Supersedes #23172 (closed) — pushed to the upstream repo instead of a fork so that test_parallel_dispatch=true and skip_stage_health_check=true can be set on the pr-test run.

Motivation

Three related CI gaps are holding back the cu12 → cu13 migration landed in #23119. This PR fixes all three so the affected tests can run in CI on CUDA 13 without regressions.

1. cudaMemcpyBatchAsync segfault on CUDA 13 (ports #23136 from @yhyang201)

CUDA 13.0 removed the failIdx parameter from cudaMemcpyBatchAsync (9 params → 8). The dlsym path in sgl-kernel/csrc/kvcacheio/transfer.cu was hard-coded to the CUDA 12.8 signature, so on cu13 the stream argument landed in the wrong slot and the runtime segfaulted inside cuMemcpyBatchAsync_v2. Fix: dispatch between the v12 and v13 signatures at runtime.

Importantly, the signature selection must follow the runtime (cudaRuntimeGetVersion), not the driver (cudaDriverGetVersion). The ABI of the symbol is owned by the libcudart actually loaded into the process — a cu12 runtime on a cu13-capable host driver (common in containers) still exposes the 9-param v12 variant, and dispatching on the driver would segfault in that case. Reproduced on lmsysorg/sglang:dev (cu12.9) on a cu13 host driver:

cudaDriverGetVersion()  = 13000
cudaRuntimeGetVersion() = 12090
v12 dispatch of dlsym'd symbol: cudaSuccess, exit 0
v13 dispatch of dlsym'd symbol: Segmentation fault (core dumped)

2. Restore hicache tests to the CI-registered suite

PR #23119 moved seven hicache tests from test/registered/ to test/manual/ because they segfaulted on cu13. With (1) fixed, they move back:

  • hicache/test_hicache_storage.py
  • hicache/test_hicache_storage_3fs_backend.py
  • hicache/test_hicache_storage_file_backend.py
  • hicache/test_hicache_storage_mooncake_backend.py (also restores the register_cuda_ci(est_time=236, suite="stage-b-test-2-gpu-large") call that was dropped on the way to manual)
  • hicache/test_hicache_storage_runtime_attach_detach.py
  • hicache/test_hicache_variants.py
  • 4-gpu-models/test_qwen35_hicache.py

All tests pass end-to-end on cu13 H200 with the fixed wheel (see validation section).

3. CI install/cleanup hygiene

Two install-path failures uncovered while rebuilding the PR's own CI:

  • flashinfer/data/ EEXIST on uv pip install for USE_VENV=false jobs (stage-a-test-1-gpu-small): uv pip uninstall flashinfer-python leaves flashinfer/data/ behind when flashinfer-cubin is kept, and the next reinstall hits File exists (os error 17). Fix in ci_install_dependency.sh purges the residual tree right after the uninstall and forces cubin to reinstall; ci_cleanup_venv.sh adds a post-job sweep as a belt-and-braces safety net so the next job's runner also starts clean.

  • Multiple libcudart libraries found: libcudart.so.12 and libcudart.so.13 from cudnn_frontend_shim.h on the SM120 (RTX 5090) runners. An orphan nvidia-cuda-runtime-cu12 wheel leftover from the pre-[CI] Add per-job uv venv isolation and upgrade CI version to Cuda 13 #23119 cu129 era is still shipping libcudart.so.12 under nvidia/cuda_runtime/lib/ next to cu13's nvidia/cu13/lib/libcudart.so.13; both end up on LD_LIBRARY_PATH and cudnn_frontend's dlopen probe throws. This is a pre-existing failure on main (same error on run 24635819338 commit 32b7777f), but carrying the fix here lets CI on this PR go green. Fix: pip uninstall -y nvidia-cuda-runtime-cu12 after the main install. A blunter sweep of all nvidia-*-cu12 would break torch (several cu12/cu13 wheel pairs share nvidia/<name>/lib/ dirs and uninstalling one wipes files the other's RECORD still references); the cu12 cuda_runtime wheel's install dir is disjoint from cu13's so this is safe.

Modifications

  • sgl-kernel/csrc/kvcacheio/transfer.cu: v12 vs v13 cudaMemcpyBatchAsync signature dispatch using cudaRuntimeGetVersion.
  • test/registered/hicache/* and test/registered/4-gpu-models/test_qwen35_hicache.py: restored from test/manual/.
  • scripts/ci/cuda/ci_install_dependency.sh: residual flashinfer/ tree purge + orphan nvidia-cuda-runtime-cu12 uninstall.
  • scripts/ci/cuda/ci_cleanup_venv.sh: post-job flashinfer cleanup for USE_VENV=false jobs.

Validation on cu13 H200 (ion-user-9, lmsysorg/sglang:dev-cu13 with this PR's wheels)

Test Status Time
test_hicache_storage.py 150s (MMLU 0.734)
test_hicache_storage_runtime_attach_detach.py 158s
test_hicache_storage_file_backend.TestHiCache
test_hicache_storage_file_backend.TestHiCacheStoragePageFirstLayout
test_hicache_storage_file_backend.TestHiCacheStoragePageFirstDirectIO 116s — direct validation of the cudaMemcpyBatchAsync path
test_hicache_storage_file_backend.TestHiCacheStorageAccuracy — (acc diff 0.0000)
test_hicache_storage_file_backend.TestHiCacheStorageMLA 97s (standalone)
test_hicache_variants.TestHiCacheStandard MMLU 0.703
test_hicache_variants.TestHiCacheMLA MMLU 0.578
test_hicache_variants.TestHiCachePage MMLU 0.75
test_hicache_variants.TestHiCacheEagle 111s (needs SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1)
4-gpu-models/test_qwen35_hicache.py (TP=4 on H200, storage on /data)

Cross-shout to @yhyang201 — the original fix in #23136 is the foundation of this PR.

Checklist

cc @Fridge003 @alisonshao

Kangyan-Zhou and others added 4 commits April 19, 2026 09:53
Port PR #23136 (Yuhao Yang): cudaMemcpyBatchAsync lost its failIdx
parameter in CUDA 13, so the dlsym-based call was passing the stream
handle at the wrong slot and segfaulting inside cuMemcpyBatchAsync_v2.
Use driver_version at runtime to dispatch to either the CUDA 12 or
CUDA 13 signature.

With the segfault fixed, move the 7 hicache tests that were parked
under test/manual in PR #23119 and subsequent cu13 flake sweeps back
into test/registered so they run in CI again:

- hicache/test_hicache_storage.py
- hicache/test_hicache_storage_3fs_backend.py
- hicache/test_hicache_storage_file_backend.py
- hicache/test_hicache_storage_mooncake_backend.py
- hicache/test_hicache_storage_runtime_attach_detach.py
- hicache/test_hicache_variants.py
- 4-gpu-models/test_qwen35_hicache.py

TODO "move back after fixed" docstrings are stripped and the
register_cuda_ci call that was dropped from the mooncake backend test
on its way to manual is restored.

Co-Authored-By: Yuhao Yang <yhyang201@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a USE_VENV=false runner had flashinfer-cubin installed
("already installed, keeping it"), `uv pip uninstall flashinfer-python`
left the flashinfer/data/ subdirectory behind (cubin files still owned
entries below it). The next `uv pip install -e python[dev,runai,tracing]`
then failed with:

  error: Failed to install: flashinfer_python-0.6.7.post3-py3-none-any.whl
    Caused by: failed to create directory
    `/usr/local/lib/python3.10/dist-packages/flashinfer/data`: File exists

Seen on stage-a-test-1-gpu-small in
https://github.com/sgl-project/sglang/actions/runs/24634237642/job/72027123887

Two-layer fix:

1. ci_install_dependency.sh (in-flight safeguard): right after the
   flashinfer uninstall step, if <site-packages>/flashinfer/ still
   exists, rm -rf it and force flashinfer-cubin to reinstall.
   `uv pip install -e python[...]` then resolves both flashinfer-python
   and flashinfer-cubin (both declared in pyproject.toml) and repopulates
   flashinfer/data/ cleanly. This makes the PR self-healing on its
   first run without depending on a prior job's post-cleanup.

2. ci_cleanup_venv.sh (post-job hygiene): the USE_VENV=false arm used
   to `exit 0` immediately. It now uninstalls the flashinfer trio and
   purges residual flashinfer/, flashinfer_cubin/, flashinfer_jit_cache/
   trees from system site-packages so the next job's runner starts
   clean even if the in-flight safeguard ever regresses. Cached wheels
   under ~/.cache/flashinfer-wheels/ keep the reinstall fast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses the review on #23172:
#23172 (comment)

cudaMemcpyBatchAsync is a libcudart (runtime) symbol; the ABI of the
function dlsym'd into this process is owned by the libcudart that's
actually loaded, not by the host's kernel driver. Dispatching on
cudaDriverGetVersion() breaks in the common container case where a
cu12 runtime is paired with a cu13-capable host driver: driver=13000
steers us to the 8-param v13 call, but the symbol resolves to v12
(9 params with failIdx), so the stream argument lands in a wrong slot
and we segfault — the exact crash this fix was supposed to prevent.

Reproduced on ion-user-9 with lmsysorg/sglang:dev (cu12.9 runtime):

    cudaDriverGetVersion()  = 13000
    cudaRuntimeGetVersion() = 12090
    v12 dispatch of dlsym'd symbol: cudaSuccess, exit 0
    v13 dispatch of dlsym'd symbol: Segmentation fault (core dumped)

Switching the signature-selection to cudaRuntimeGetVersion makes the
choice follow the loaded libcudart, which is what actually determines
the ABI. The existing cudaDriverGetVersion guard above is kept — it
remains the right knob for the capability check since cudaMemcpyBatch
requires a 12.8+ driver regardless of the runtime version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the main `uv pip install -e python[...]` step, runners that carried
state from the pre-#23119 (cu129) era keep `nvidia-cuda-runtime-cu12`
installed as an orphan (Required-by: empty) alongside the cu13 runtime.
Its libcudart.so.12 sits under `nvidia/cuda_runtime/lib/` while cu13's
lives under `nvidia/cu13/lib/`. Both dirs end up on LD_LIBRARY_PATH, so
cudnn_frontend_shim.h's probe

    for lib in ["libcudart.so.12", "libcudart.so.13"]:
        dlopen(lib)

loads both and throws:

    RuntimeError: Multiple libcudart libraries found:
    libcudart.so.12 and libcudart.so.13

Tests hit this during server setUpClass → CUDA graph capture (e.g.
test_nvfp4_gemm_sm120.py on stage-b-test-1-gpu-small). The same failure
reproduces on main, so this is not PR-specific — it's a leftover cleanup
step the cu13 migration missed.

Fix: uninstall nvidia-cuda-runtime-cu12 right after the main install.
Its install dir is disjoint from cu13's so the uninstall doesn't touch
any files shared with cu13 packages (a blunter sweep of all
`nvidia-*-cu12` breaks torch because several pairs share dirs under
`nvidia/<name>/lib/` and uninstalling one deletes files that the cu13
variant still references through its RECORD).

Reproduced and verified on 5090-novita-ci-runner-d (runner-1 container):

    before: libcudart.so.12 + libcudart.so.13 both loadable
    after : only libcudart.so.13 loadable, torch.cuda.randn works

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses CUDA 13 migration issues by improving CI environment cleanup and implementing dynamic dispatch for CUDA API calls. CI scripts were updated to purge stale flashinfer and legacy nvidia-cuda-runtime-cu12 packages, preventing installation failures and library conflicts. In the sgl-kernel, the code now detects the CUDA runtime version to correctly call cudaMemcpyBatchAsync, which had its signature changed in CUDA 13. These changes allow for the re-enabling of several tests that were previously failing. Feedback was provided to optimize the runtime version check by making it static to avoid redundant API calls in the hot path.

Comment thread sgl-kernel/csrc/kvcacheio/transfer.cu Outdated
Comment on lines +821 to +827
int runtime_version = 0;
cudaError_t runtime_version_err = cudaRuntimeGetVersion(&runtime_version);
if (runtime_version_err != cudaSuccess) {
fallback_to_page_copy();
return;
}
const bool use_v13_signature = runtime_version >= 13000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The CUDA runtime version check is performed on every call to transfer_kv_page_first_direct_impl. Since the runtime version is constant for the duration of the process, these variables should be declared static to avoid redundant API calls in the hot path of KV cache transfers.

  static int runtime_version = 0;
  static cudaError_t runtime_version_err = cudaRuntimeGetVersion(&runtime_version);
  if (runtime_version_err != cudaSuccess) {
    fallback_to_page_copy();
    return;
  }
  static const bool use_v13_signature = runtime_version >= 13000;

Two PR-local changes:

1. sgl-kernel/csrc/kvcacheio/transfer.cu: address code-review feedback
   (#23183 (comment)).
   The runtime version is constant for the process lifetime, so cache the
   cudaRuntimeGetVersion result and the derived use_v13_signature as
   static locals (thread-safe static init in C++11+). Keeps the KV-transfer
   hot path free of a redundant runtime-API call per invocation.

2. .github/workflows/pr-test.yml: local override so this branch exercises
   the restored hicache suite end-to-end on cu13 without stage-a fast-
   failing the rest of the run:

   - SKIP_STAGE_HEALTH_CHECK: hard-coded to 'true'
   - wait-for-stage-a / wait-for-stage-b: `if:` gated with `false &&` so
     every stage dispatches in parallel (mimics the scheduled-run path).
     Downstream stage jobs already accept `wait-for-stage-*.result ==
     'skipped'`, so nothing else needs to change.

   REVERT THESE WORKFLOW CHANGES BEFORE MERGE — they are only here to
   unblock validation on this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Actions expression language doesn't accept # comments inline;
the previous commit put them on the same line as `false &&`, which
made the whole workflow fail to load ('This run likely failed because
of a workflow file issue', no jobs dispatched). Move the override
context to YAML-level comments above each wait-for-stage block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Reframe the three defensive cleanups in ci_install_dependency.sh and
ci_cleanup_venv.sh around the 'long-lived runner state' invariant, so
future maintainers don't misread them as incident-specific workarounds
and delete them. Content is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kangyan-Zhou added a commit to yhyang201/sglang that referenced this pull request Apr 20, 2026
Addresses code-review feedback on the sibling PR sgl-project#23183:
sgl-project#23183 (comment)

The runtime version is constant for the process lifetime, so cache the
cudaRuntimeGetVersion result and the derived use_v13_signature as
static locals (thread-safe static init in C++11+). Keeps the
KV-transfer hot path free of a redundant runtime-API call per
invocation.

Other diff in this commit is clang-format reflowing the v12/v13 dlsym
call sites to the repo's column-limit style — no semantic change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hicache Hierarchical Caching for SGLang run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant