Skip to content

[b200] fix piecewise cuda graph launch bug#12067

Merged
ispobock merged 1 commit intomainfrom
piecewise_cuda_graph_support_in_b200
Oct 24, 2025
Merged

[b200] fix piecewise cuda graph launch bug#12067
ispobock merged 1 commit intomainfrom
piecewise_cuda_graph_support_in_b200

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Oct 24, 2025

Motivation

This PR fixes two bugs when running on NVIDIA B200 (Blackwell) GPUs with piecewise CUDA graph enabled:

bug 1: Missing forward metadata initialization

Problem: AttributeError: 'DecodeMetadata' object has no attribute 'prefill_wrappers' occurs during warmup phase in piecewise CUDA graph runner.

Root Cause: The warmup_and_capture() method in piecewise_cuda_graph_runner.py calls model.forward() directly without initializing the attention backend's forward metadata, causing the metadata to be uninitialized or in an incorrect state.

Fix: Add self.model_runner.attn_backend.init_forward_metadata(forward_batch) before calling model.forward() in the warmup phase.

bug 2: CUTLASS backend TMA descriptor initialization failures on B200

Problem: When using CUTLASS backend with piecewise CUDA graph on B200, the server crashes with Error: Failed to initialize the TMA descriptor errors showing invalid tensor dimensions (e.g., globalDim (128,0,8,1,4) with zero dimension).

Root Cause: FlashInfer's CUTLASS backend has compatibility issues with B200's TMA (Tensor Memory Accelerator) features when combined with piecewise CUDA graph capture.

Fix: Intelligently disable CUTLASS backend and fall back to FlashAttention-2 only when both B200 (SM100) and piecewise CUDA graph are enabled. A clear warning message informs users of this backend selection. When piecewise CUDA graph is disabled, B200 can still use CUTLASS for optimal performance.

python3 -m sglang.launch_server --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct --tp 4 --host 0.0.0.0 --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager

result:

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|████████████████████████████████████████████████████████████| 1319/1319 [00:22<00:00, 59.25it/s]
Accuracy: 0.929
Invalid: 0.001
Latency: 22.355 s
Output throughput: 7932.862 token/s

Without fix for Issue 2

100%|████████████████████████████████████████████████████████████| 1319/1319 [00:55<00:00, 23.56it/s]
Accuracy: 0.174
Invalid: 0.817
Latency: 56.065 s
Output throughput: 10410.864 token/s

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @BBuf, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical issues affecting the stability and performance of the system on NVIDIA B200 GPUs when utilizing piecewise CUDA graphs. It addresses a metadata initialization bug during the warmup phase and tackles compatibility problems with the CUTLASS backend, ensuring robust operation and preventing crashes. The changes improve the reliability and accuracy of model execution under these specific hardware and configuration settings.

Highlights

  • Fix AttributeError during warmup with piecewise CUDA graph: Resolved an AttributeError: 'DecodeMetadata' object has no attribute 'prefill_wrappers' that occurred during the warmup phase of the piecewise CUDA graph runner. The root cause was the model.forward() call in warmup_and_capture() missing the necessary attn_backend.init_forward_metadata() initialization. The fix adds this initialization before the forward pass.
  • Address CUTLASS backend incompatibility on B200 with piecewise CUDA graph: Fixed server crashes on NVIDIA B200 GPUs when using the CUTLASS backend with piecewise CUDA graphs, which were caused by TMA descriptor initialization failures. The solution intelligently disables the CUTLASS backend and falls back to FlashAttention-2 when both B200 (SM100) and piecewise CUDA graph are enabled, while allowing CUTLASS for optimal performance when the graph is disabled. A warning message is now displayed to inform users of this backend selection.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two important fixes for running on NVIDIA B200 GPUs with piecewise CUDA graph enabled. The first fix correctly initializes attention metadata during the warmup phase, and the second fix disables the CUTLASS backend to avoid TMA descriptor issues, falling back to a stable alternative.

While reviewing the changes, I noticed the removal of init_forward_metadata from the capture_one_batch_size function. This appears to be a potential regression, as this call is necessary to prepare the correct attention metadata for each graph capture. I've added a critical comment with a suggestion to restore this call. Other than that, the changes look good and address the described issues effectively.

@@ -375,9 +378,6 @@ def capture_one_batch_size(self, num_tokens: int):
if lora_ids is not None:
self.model_runner.lora_manager.prepare_lora_batch(forward_batch)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The call to self.model_runner.attn_backend.init_forward_metadata(forward_batch) was removed from this function. This seems incorrect as the CUDA graph capture process in run_once relies on the metadata being correctly initialized for the current forward_batch. Without this call, the capture will use stale metadata from the warmup_and_capture phase, which is for a different batch configuration and will likely lead to errors or incorrect behavior. Please consider re-adding this call.

        self.model_runner.attn_backend.init_forward_metadata(forward_batch)

@ispobock ispobock merged commit 8470133 into main Oct 24, 2025
44 of 71 checks passed
@ispobock ispobock deleted the piecewise_cuda_graph_support_in_b200 branch October 24, 2025 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants