[b200] fix piecewise cuda graph launch bug by BBuf · Pull Request #12067 · sgl-project/sglang

BBuf · 2025-10-24T08:51:04Z

Motivation

This PR fixes two bugs when running on NVIDIA B200 (Blackwell) GPUs with piecewise CUDA graph enabled:

bug 1: Missing forward metadata initialization

Problem: AttributeError: 'DecodeMetadata' object has no attribute 'prefill_wrappers' occurs during warmup phase in piecewise CUDA graph runner.

Root Cause: The warmup_and_capture() method in piecewise_cuda_graph_runner.py calls model.forward() directly without initializing the attention backend's forward metadata, causing the metadata to be uninitialized or in an incorrect state.

Fix: Add self.model_runner.attn_backend.init_forward_metadata(forward_batch) before calling model.forward() in the warmup phase.

bug 2: CUTLASS backend TMA descriptor initialization failures on B200

Problem: When using CUTLASS backend with piecewise CUDA graph on B200, the server crashes with Error: Failed to initialize the TMA descriptor errors showing invalid tensor dimensions (e.g., globalDim (128,0,8,1,4) with zero dimension).

Root Cause: FlashInfer's CUTLASS backend has compatibility issues with B200's TMA (Tensor Memory Accelerator) features when combined with piecewise CUDA graph capture.

Fix: Intelligently disable CUTLASS backend and fall back to FlashAttention-2 only when both B200 (SM100) and piecewise CUDA graph are enabled. A clear warning message informs users of this backend selection. When piecewise CUDA graph is disabled, B200 can still use CUTLASS for optimal performance.

python3 -m sglang.launch_server --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct --tp 4 --host 0.0.0.0 --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager

result:

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|████████████████████████████████████████████████████████████| 1319/1319 [00:22<00:00, 59.25it/s]
Accuracy: 0.929
Invalid: 0.001
Latency: 22.355 s
Output throughput: 7932.862 token/s

Without fix for Issue 2

100%|████████████████████████████████████████████████████████████| 1319/1319 [00:55<00:00, 23.56it/s]
Accuracy: 0.174
Invalid: 0.817
Latency: 56.065 s
Output throughput: 10410.864 token/s

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-24T08:51:31Z

Summary of Changes

Hello @BBuf, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical issues affecting the stability and performance of the system on NVIDIA B200 GPUs when utilizing piecewise CUDA graphs. It addresses a metadata initialization bug during the warmup phase and tackles compatibility problems with the CUTLASS backend, ensuring robust operation and preventing crashes. The changes improve the reliability and accuracy of model execution under these specific hardware and configuration settings.

Highlights

Fix AttributeError during warmup with piecewise CUDA graph: Resolved an AttributeError: 'DecodeMetadata' object has no attribute 'prefill_wrappers' that occurred during the warmup phase of the piecewise CUDA graph runner. The root cause was the model.forward() call in warmup_and_capture() missing the necessary attn_backend.init_forward_metadata() initialization. The fix adds this initialization before the forward pass.
Address CUTLASS backend incompatibility on B200 with piecewise CUDA graph: Fixed server crashes on NVIDIA B200 GPUs when using the CUTLASS backend with piecewise CUDA graphs, which were caused by TMA descriptor initialization failures. The solution intelligently disables the CUTLASS backend and falls back to FlashAttention-2 when both B200 (SM100) and piecewise CUDA graph are enabled, while allowing CUTLASS for optimal performance when the graph is disabled. A warning message is now displayed to inform users of this backend selection.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two important fixes for running on NVIDIA B200 GPUs with piecewise CUDA graph enabled. The first fix correctly initializes attention metadata during the warmup phase, and the second fix disables the CUTLASS backend to avoid TMA descriptor issues, falling back to a stable alternative.

While reviewing the changes, I noticed the removal of init_forward_metadata from the capture_one_batch_size function. This appears to be a potential regression, as this call is necessary to prepare the correct attention metadata for each graph capture. I've added a critical comment with a suggestion to restore this call. Other than that, the changes look good and address the described issues effectively.

gemini-code-assist · 2025-10-24T08:53:03Z

@@ -375,9 +378,6 @@ def capture_one_batch_size(self, num_tokens: int):
        if lora_ids is not None:
            self.model_runner.lora_manager.prepare_lora_batch(forward_batch)



The call to self.model_runner.attn_backend.init_forward_metadata(forward_batch) was removed from this function. This seems incorrect as the CUDA graph capture process in run_once relies on the metadata being correctly initialized for the current forward_batch. Without this call, the capture will use stale metadata from the warmup_and_capture phase, which is for a different batch configuration and will likely lead to errors or incorrect behavior. Please consider re-adding this call.

self.model_runner.attn_backend.init_forward_metadata(forward_batch)

[b200] fix piecewise cuda graph launch bug

24c3907

BBuf requested review from Edwardf0t1, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy and zhyncs as code owners October 24, 2025 08:51

sglang-bot added the run-ci label Oct 24, 2025

gemini-code-assist Bot reviewed Oct 24, 2025

View reviewed changes

ispobock approved these changes Oct 24, 2025

View reviewed changes

ispobock merged commit 8470133 into main Oct 24, 2025
44 of 71 checks passed

ispobock deleted the piecewise_cuda_graph_support_in_b200 branch October 24, 2025 14:36

BBuf mentioned this pull request Oct 24, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Closed

34 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[b200] fix piecewise cuda graph launch bug#12067

[b200] fix piecewise cuda graph launch bug#12067
ispobock merged 1 commit intomainfrom
piecewise_cuda_graph_support_in_b200

BBuf commented Oct 24, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Oct 24, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -375,9 +378,6 @@ def capture_one_batch_size(self, num_tokens: int):
		if lora_ids is not None:
		self.model_runner.lora_manager.prepare_lora_batch(forward_batch)

Conversation

BBuf commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

bug 1: Missing forward metadata initialization

bug 2: CUTLASS backend TMA descriptor initialization failures on B200

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Oct 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BBuf commented Oct 24, 2025 •

edited

Loading