Fix segfault in cudaMemcpyBatchAsync on CUDA 13.0 by yhyang201 · Pull Request #23136 · sgl-project/sglang

yhyang201 · 2026-04-18T15:59:11Z

Motivation

CUDA 13.0 removed the failIdx parameter from cudaMemcpyBatchAsync (8 params), but the code was using the CUDA 12.8 signature (9 params) via dlsym. This caused the stream argument to be misaligned — the runtime received a stack pointer as the CUDA stream handle, resulting in a segfault inside cuMemcpyBatchAsync_v2.

Fixed by using runtime driver_version detection to select the correct function signature, ensuring binary portability across CUDA 12.8 and 13.0 environments.

This may not be the optimal approach — the main intent of this PR is to identify the root cause of the segfault and provide a working fix.

cc @alisonshao @Fridge003

Modifications

Accuracy Tests

Environment

Item	Value
GPU	NVIDIA H200 x8
Driver	580.126.09
CUDA	13.0
PyTorch	2.9.1+cu130
nvcc	13.0, V13.0.88

Result

All 192 tests in sgl-kernel/tests/test_kvcacheio.py pass. Previously crashed at test_transfer_kv_pf_direct (~37%).

Before (segfault at the first test_transfer_kv_pf_direct case, ~37%):

sgl-kernel/tests/test_kvcacheio.py::test_transfer_kv_pf_direct[False-False-20480-256-16-128-dtype0]
!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in cuMemcpyBatchAsync_v2
  File "<unknown>", line 0, in cudaMemcpyBatchAsync
  File "<unknown>", line 0, in void transfer_kv_page_first_direct_impl<false>(...)

After (192/192 passed):

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-9.0.3, pluggy-1.6.0

sgl-kernel/tests/test_kvcacheio.py::test_transfer_kv[False-False-10240-256-1-1-dtype0] PASSED [  0%]
...
sgl-kernel/tests/test_kvcacheio.py::test_transfer_kv_pf_direct[False-False-20480-256-16-128-dtype0] PASSED [ 38%]  <-- previously segfaulted here
...
sgl-kernel/tests/test_kvcacheio.py::test_transfer_kv_page_head[True-4096-16-1024-128-1024-dtype1] PASSED [100%]

======================= 192 passed in 239.86s (0:03:59) ========================

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CUDA 13.0 removed the failIdx parameter from cudaMemcpyBatchAsync, causing a segfault due to argument mismatch when called via dlsym. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces support for the updated cudaMemcpyBatchAsync signature in CUDA 13.0, which removed the failIdx parameter. The implementation uses conditional compilation to select the appropriate function signature and call site. Feedback indicates that using compile-time macros for runtime symbol loading via dlsym breaks binary portability between CUDA versions; instead, the runtime version should be checked. Additionally, a potential memory safety issue was identified where the attrs_idxs array size may not match the number of copies, leading to undefined behavior.

Replace compile-time #if CUDA_VERSION with runtime driver_version check to select the correct function signature. This ensures binary portability across CUDA 12.8 and 13.0 environments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fridge003 · 2026-04-18T21:09:00Z

Thanks @yhyang201 !

Port PR sgl-project#23136 (Yuhao Yang): cudaMemcpyBatchAsync lost its failIdx parameter in CUDA 13, so the dlsym-based call was passing the stream handle at the wrong slot and segfaulting inside cuMemcpyBatchAsync_v2. Use driver_version at runtime to dispatch to either the CUDA 12 or CUDA 13 signature. With the segfault fixed, move the 7 hicache tests that were parked under test/manual in PR sgl-project#23119 and subsequent cu13 flake sweeps back into test/registered so they run in CI again: - hicache/test_hicache_storage.py - hicache/test_hicache_storage_3fs_backend.py - hicache/test_hicache_storage_file_backend.py - hicache/test_hicache_storage_mooncake_backend.py - hicache/test_hicache_storage_runtime_attach_detach.py - hicache/test_hicache_variants.py - 4-gpu-models/test_qwen35_hicache.py TODO "move back after fixed" docstrings are stripped and the register_cuda_ci call that was dropped from the mooncake backend test on its way to manual is restored. Co-Authored-By: Yuhao Yang <yhyang201@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kangyan-Zhou · 2026-04-19T18:27:34Z


+  // CUDA 13.0 removed the failIdx parameter from cudaMemcpyBatchAsync.
+  // Use runtime version to select the correct signature for binary portability.
+  const bool use_v13_signature = driver_version >= 13000;


cudaMemcpyBatchAsync is a libcudart (runtime) symbol, so the ABI of the function dlsym'd into the process is owned by whichever libcudart is actually loaded — not by the host's kernel driver. cudaDriverGetVersion() reports the driver version, which in containerized setups routinely diverges from the runtime: a cu12 runtime (e.g. lmsysorg/sglang:dev, cu12.9) paired with a cu13-capable host driver is common. In that case driver_version = 13000 steers us to the 8-param v13 call, but the dlsym'd symbol is the 9-param v12 variant — the stream argument lands in a wrong slot and we segfault. Same class of crash this PR is trying to fix.

Reproduced on a cu13 host / cu12.9 container:

cudaDriverGetVersion() = 13000 cudaRuntimeGetVersion() = 12090 call with v12 dispatch -> cudaSuccess, exit 0 call with v13 dispatch -> Segmentation fault (core dumped), exit 139

The existing cudaDriverGetVersion gate on the capability check (< 12080 -> fallback) is fine — that's the right knob for "is the driver new enough to support this at all". It's just the signature selection that needs to follow the runtime.

Suggested fix:

Suggested change

const bool use_v13_signature = driver_version >= 13000;

// CUDA 13.0 removed the failIdx parameter from cudaMemcpyBatchAsync. The ABI

// of the dlsym'd symbol is determined by the libcudart loaded in this process,

// not the host driver — a cu12 runtime on a cu13 driver host (common in

// containers) still exposes the 9-param v12 signature. Dispatching on the

// driver version here would segfault in that case (verified empirically).

// Use cudaRuntimeGetVersion so the signature follows the runtime.

int runtime_version = 0;

cudaError_t runtime_version_err = cudaRuntimeGetVersion(&runtime_version);

if (runtime_version_err != cudaSuccess) {

fallback_to_page_copy();

return;

}

const bool use_v13_signature = runtime_version >= 13000;

FYI I've already applied this fix on the port of your PR in #23172 (3d3428e4f) if you want to cherry-pick. Thanks for the original fix!

Addresses the review on sgl-project#23172: sgl-project#23172 (comment) cudaMemcpyBatchAsync is a libcudart (runtime) symbol; the ABI of the function dlsym'd into this process is owned by the libcudart that's actually loaded, not by the host's kernel driver. Dispatching on cudaDriverGetVersion() breaks in the common container case where a cu12 runtime is paired with a cu13-capable host driver: driver=13000 steers us to the 8-param v13 call, but the symbol resolves to v12 (9 params with failIdx), so the stream argument lands in a wrong slot and we segfault — the exact crash this fix was supposed to prevent. Reproduced on ion-user-9 with lmsysorg/sglang:dev (cu12.9 runtime): cudaDriverGetVersion() = 13000 cudaRuntimeGetVersion() = 12090 v12 dispatch of dlsym'd symbol: cudaSuccess, exit 0 v13 dispatch of dlsym'd symbol: Segmentation fault (core dumped) Switching the signature-selection to cudaRuntimeGetVersion makes the choice follow the loaded libcudart, which is what actually determines the ABI. The existing cudaDriverGetVersion guard above is kept — it remains the right knob for the capability check since cudaMemcpyBatch requires a 12.8+ driver regardless of the runtime version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yhyang201 · 2026-04-20T04:09:53Z

~~Move to #23183~~

yhyang201 · 2026-04-20T04:32:03Z

/tag-and-rerun-ci

Addresses code-review feedback on the sibling PR sgl-project#23183: sgl-project#23183 (comment) The runtime version is constant for the process lifetime, so cache the cudaRuntimeGetVersion result and the derived use_v13_signature as static locals (thread-safe static init in C++11+). Keeps the KV-transfer hot path free of a redundant runtime-API call per invocation. Other diff in this commit is clang-format reflowing the v12/v13 dlsym call sites to the repo's column-limit style — no semantic change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kangyan-Zhou · 2026-04-20T19:20:03Z

Tested in https://github.com/sgl-project/sglang/actions/runs/24640937123

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

fix: adapt cudaMemcpyBatchAsync for CUDA 13.0 API change

88e4d07

CUDA 13.0 removed the failIdx parameter from cudaMemcpyBatchAsync, causing a segfault due to argument mismatch when called via dlsym. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yhyang201 requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy and yizhang2077 as code owners April 18, 2026 15:59

github-actions Bot added the sgl-kernel label Apr 18, 2026

gemini-code-assist Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread sgl-kernel/csrc/kvcacheio/transfer.cu Outdated

Comment thread sgl-kernel/csrc/kvcacheio/transfer.cu Outdated

yhyang201 and others added 2 commits April 18, 2026 16:10

fix: preserve failIdx in error message for v12 path

3f274fd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into fix/cudaMemcpyBatchAsync-cuda13

83c1e78

Kangyan-Zhou mentioned this pull request Apr 19, 2026

Fix CUDA 13 cudaMemcpyBatchAsync segfault and restore hicache CI #23172

Closed

5 tasks

Kangyan-Zhou reviewed Apr 19, 2026

View reviewed changes

Kangyan-Zhou mentioned this pull request Apr 19, 2026

Fix CUDA 13 cudaMemcpyBatchAsync segfault and restore hicache CI #23183

Closed

5 tasks

yhyang201 closed this Apr 20, 2026

yhyang201 reopened this Apr 20, 2026

github-actions Bot added the run-ci label Apr 20, 2026

Fridge003 added the high priority label Apr 20, 2026

Fridge003 approved these changes Apr 20, 2026

View reviewed changes

Kangyan-Zhou added 2 commits April 19, 2026 23:45

Merge branch 'main' into fix/cudaMemcpyBatchAsync-cuda13

0e5e7a2

Merge branch 'main' into fix/cudaMemcpyBatchAsync-cuda13

fd91dc8

Kangyan-Zhou merged commit fe9b9b2 into sgl-project:main Apr 20, 2026
46 of 75 checks passed

zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026

Fix segfault in cudaMemcpyBatchAsync on CUDA 13.0 (sgl-project#23136)

fd3b358

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026

Fix segfault in cudaMemcpyBatchAsync on CUDA 13.0 (sgl-project#23136)

c07f11a

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

llc-kc mentioned this pull request May 9, 2026

fix(sgl-kernel): CUDA 13 cudaMemcpyBatchAsync API compatibility #22120

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix segfault in cudaMemcpyBatchAsync on CUDA 13.0#23136

Fix segfault in cudaMemcpyBatchAsync on CUDA 13.0#23136
Kangyan-Zhou merged 8 commits intosgl-project:mainfrom
yhyang201:fix/cudaMemcpyBatchAsync-cuda13

yhyang201 commented Apr 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Apr 18, 2026

Uh oh!

Kangyan-Zhou Apr 19, 2026

Uh oh!

yhyang201 commented Apr 20, 2026 •

edited

Loading

Uh oh!

yhyang201 commented Apr 20, 2026

Uh oh!

Kangyan-Zhou commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-  const bool use_v13_signature = driver_version >= 13000;
+  // CUDA 13.0 removed the failIdx parameter from cudaMemcpyBatchAsync. The ABI
+  // of the dlsym'd symbol is determined by the libcudart loaded in this process,
+  // not the host driver — a cu12 runtime on a cu13 driver host (common in
+  // containers) still exposes the 9-param v12 signature. Dispatching on the
+  // driver version here would segfault in that case (verified empirically).
+  // Use cudaRuntimeGetVersion so the signature follows the runtime.
+  int runtime_version = 0;
+  cudaError_t runtime_version_err = cudaRuntimeGetVersion(&runtime_version);
+  if (runtime_version_err != cudaSuccess) {
+    fallback_to_page_copy();
+    return;
+  }
+  const bool use_v13_signature = runtime_version >= 13000;

Conversation

yhyang201 commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Environment

Result

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Apr 18, 2026

Uh oh!

Kangyan-Zhou Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

yhyang201 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhyang201 commented Apr 20, 2026

Uh oh!

Kangyan-Zhou commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yhyang201 commented Apr 18, 2026 •

edited

Loading

yhyang201 commented Apr 20, 2026 •

edited

Loading