[b200] fix piecewise cuda graph launch bug#12067
Conversation
Summary of ChangesHello @BBuf, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves critical issues affecting the stability and performance of the system on NVIDIA B200 GPUs when utilizing piecewise CUDA graphs. It addresses a metadata initialization bug during the warmup phase and tackles compatibility problems with the CUTLASS backend, ensuring robust operation and preventing crashes. The changes improve the reliability and accuracy of model execution under these specific hardware and configuration settings. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces two important fixes for running on NVIDIA B200 GPUs with piecewise CUDA graph enabled. The first fix correctly initializes attention metadata during the warmup phase, and the second fix disables the CUTLASS backend to avoid TMA descriptor issues, falling back to a stable alternative.
While reviewing the changes, I noticed the removal of init_forward_metadata from the capture_one_batch_size function. This appears to be a potential regression, as this call is necessary to prepare the correct attention metadata for each graph capture. I've added a critical comment with a suggestion to restore this call. Other than that, the changes look good and address the described issues effectively.
| @@ -375,9 +378,6 @@ def capture_one_batch_size(self, num_tokens: int): | |||
| if lora_ids is not None: | |||
| self.model_runner.lora_manager.prepare_lora_batch(forward_batch) | |||
|
|
|||
There was a problem hiding this comment.
The call to self.model_runner.attn_backend.init_forward_metadata(forward_batch) was removed from this function. This seems incorrect as the CUDA graph capture process in run_once relies on the metadata being correctly initialized for the current forward_batch. Without this call, the capture will use stale metadata from the warmup_and_capture phase, which is for a different batch configuration and will likely lead to errors or incorrect behavior. Please consider re-adding this call.
self.model_runner.attn_backend.init_forward_metadata(forward_batch)
Motivation
This PR fixes two bugs when running on NVIDIA B200 (Blackwell) GPUs with piecewise CUDA graph enabled:
bug 1: Missing forward metadata initialization
Problem:
AttributeError: 'DecodeMetadata' object has no attribute 'prefill_wrappers'occurs during warmup phase in piecewise CUDA graph runner.Root Cause: The
warmup_and_capture()method inpiecewise_cuda_graph_runner.pycallsmodel.forward()directly without initializing the attention backend's forward metadata, causing the metadata to be uninitialized or in an incorrect state.Fix: Add
self.model_runner.attn_backend.init_forward_metadata(forward_batch)before callingmodel.forward()in the warmup phase.bug 2: CUTLASS backend TMA descriptor initialization failures on B200
Problem: When using CUTLASS backend with piecewise CUDA graph on B200, the server crashes with
Error: Failed to initialize the TMA descriptorerrors showing invalid tensor dimensions (e.g.,globalDim (128,0,8,1,4)with zero dimension).Root Cause: FlashInfer's CUTLASS backend has compatibility issues with B200's TMA (Tensor Memory Accelerator) features when combined with piecewise CUDA graph capture.
Fix: Intelligently disable CUTLASS backend and fall back to FlashAttention-2 only when both B200 (SM100) and piecewise CUDA graph are enabled. A clear warning message informs users of this backend selection. When piecewise CUDA graph is disabled, B200 can still use CUTLASS for optimal performance.
result:
Without fix for Issue 2
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist