Conversation
Summary of ChangesHello @ByronHsu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the SGLang framework by enabling the use of custom allreduce operations within piecewise CUDA graphs. This change addresses previous compatibility issues that required disabling custom allreduce during graph capture, leading to a more unified and potentially more efficient execution flow for distributed models. The modifications simplify the codebase by removing explicit context managers for allreduce communication and refine the CUDA graph capture process to properly account for these operations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the handling of custom allreduce communication within the piecewise CUDA graph capture mechanism. The previous temporary context managers (disable_ca_comm, use_original_ca_comm) have been removed and replaced with a more integrated approach using nullcontext and ca_comm.capture(). This simplifies the code and improves maintainability. Additionally, an unnecessary warmup run during CUDA graph capture has been removed, enhancing efficiency. The changes also include removing unused imports and adjusting the mm_utils.py file to reflect the removal of get_tp_group and use_original_ca_comm.
|
CUDA CoreDump Repro
CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 \
CUDA_COREDUMP_SHOW_PROGRESS=1 \
CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory' \
CUDA_COREDUMP_FILE="/tmp/cuda_coredump_%h.%p.%t" CUDA_LAUNCH_BLOCKING=1 python -m sglang.launch_server --model Qwen/Qwen2.5-VL-7B-Instruct --tp 4 --enable-piecewise-cuda-graph --disable-radix-cache
import requests
url = "http://127.0.0.1:30000/generate"
seq_lens = [1025] * 2
for seq_len in seq_lens:
data = {
"input_ids": [0] * seq_len,
"sampling_params": {
"temperature": 0.0,
"max_new_tokens": 32,
},
}
response = requests.post(url, json=data)
print(response.json())
[2025-12-09 18:59:27 TP0] Prefill batch, #new-seq: 1, #new-token: 1025, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[18:59:28.674952] coredump: Starting GPU coredump generation
[18:59:28.706925] coredump: Starting GPU coredump generation
[18:59:28.707040] coredump: SM 1/132 is not used by any context
[18:59:28.707046] coredump: SM 2/132 is not used by any context
[18:59:28.707049] coredump: SM 3/132 is not used by any context
[18:59:28.707053] coredump: SM 4/132 is not used by any context
[18:59:28.707057] coredump: SM 5/132 is not used by any context
[18:59:28.707061] coredump: SM 6/132 is not used by any context
[18:59:28.707065] coredump: SM 7/132 is not used by any context
[18:59:28.707068] coredump: SM 8/132 is not used by any context
[18:59:28.707072] coredump: SM 9/132 is not used by any context
[18:59:28.707075] coredump: SM 10/132 is not used by any context
[18:59:28.707079] coredump: SM 11/132 is not used by any context
[18:59:28.707083] coredump: SM 12/132 is not used by any context
[18:59:28.707087] coredump: SM 13/132 is not used by any context
[18:59:28.707090] coredump: SM 14/132 is not used by any context
[18:59:28.707094] coredump: SM 15/132 is not used by any context
[18:59:28.707098] coredump: SM 16/132 is not used by any context
[18:59:28.707102] coredump: SM 17/132 is not used by any context
[18:59:28.707105] coredump: SM 18/132 is not used by any context
[18:59:28.707109] coredump: SM 19/132 is not used by any context
[18:59:28.707112] coredump: SM 20/132 is not used by any context
[18:59:28.707116] coredump: SM 21/132 is not used by any context
[18:59:28.707120] coredump: SM 22/132 is not used by any context
[18:59:28.707123] coredump: SM 23/132 is not used by any context
[18:59:28.707127] coredump: SM 24/132 is not used by any context
[18:59:28.707131] coredump: SM 25/132 is not used by any context
[18:59:28.707135] coredump: SM 26/132 is not used by any context
[18:59:28.707138] coredump: SM 27/132 is not used by any context
[18:59:28.707142] coredump: SM 28/132 is not used by any context
[18:59:28.707146] coredump: SM 29/132 is not used by any context
[18:59:28.707149] coredump: SM 30/132 is not used by any context
[18:59:28.707153] coredump: SM 31/132 is not used by any context
[18:59:28.707157] coredump: SM 32/132 is not used by any context
*[byron/pcg-ca][~/sglang]$ cuda-gdb
NVIDIA (R) cuda-gdb 12.9
Portions Copyright (C) 2007-2025 NVIDIA Corporation
Based on GNU gdb 14.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This CUDA-GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://forums.developer.nvidia.com/c/developer-tools/cuda-developer-tools/cuda-gdb>.
Find the CUDA-GDB manual and other documentation resources online at:
<https://docs.nvidia.com/cuda/cuda-gdb/index.html>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
(cuda-gdb) target cudacore /tmp/cuda_coredump_memx-cla-24-sr1.xpop.twttr.net.2161772.1765306609
Opening GPU coredump: /tmp/cuda_coredump_memx-cla-24-sr1.xpop.twttr.net.2161772.1765306609
[Current focus set to CUDA kernel 0, grid 155886, block (20,0,0), thread (288,0,0), device 3, sm 0, warp 9, lane 0]
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7f0cf1bcec70 void sglang::cross_device_reduce_2stage<__nv_bfloat16, 4>(sglang::RankData*, sglang::RankSignals, sglang::Signal*, __nv_bfloat16*, int, int)
#0 0x00007f0cf1bcecd0 in void sglang::cross_device_reduce_2stage<__nv_bfloat16, 4>(sglang::RankData*, sglang::RankSignals, sglang::Signal*, __nv_bfloat16*, int, int)
<<<(36,1,1),(512,1,1)>>> ()
(cuda-gdb) |
Summary
Fix custom all-reduce support in piecewise CUDA graph by properly registering IPC buffers during graph capture.
Motivation
Custom all-reduce fails with piecewise CUDA graph:
Root Cause
1. Missing IPC buffer registration
In regular
CudaGraphRunner, thegraph_capture()context internally callsca_comm.capture()which:_IS_CAPTURING = Trueduring captureregister_graph_buffers()after capture to register IPC addressesDuring capture, when
_IS_CAPTURING = True, buffer addresses are collected:After capture,
register_graph_buffers()exchanges IPC handles between ranks and opens peer pointers. Without this, allreduce accesses invalid pointers → illegal memory access. (Details about how ipc handle work in custom all-reduce + cuda graph)2. Incorrect warmup iterations (3 → 2)
With
warmup_torch_compile()already running the model once, 3 additional iterations means:Replay during capture tries to use IPC addresses before
register_graph_buffers()is called (it's called after the context exits).Modifications
ca_comm.capture()context for proper IPC buffer registration:disable_ca_commanduse_original_ca_commworkaround that previously disabled custom allreduce entirely during piecewise operations. ([piecewise] Refactor VLM to support input embed buffer and remove external embedder hack #14155)Debugging Notes
The error message is
I also use cuda core dump to pinpoint the error is in custom all reduce.