misc: Add XQA decode to microbenchmark for sm90 and sm120#2055
misc: Add XQA decode to microbenchmark for sm90 and sm120#2055yzh119 merged 3 commits intoflashinfer-ai:mainfrom
Conversation
… decode on sm90 and sm120
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThis PR renames and normalizes a deprecated backend identifier: it replaces occurrences of Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant CLI as parse_attention_args
participant Norm as normalize_backends
participant Runner as BenchmarkRunner
User->>CLI: invoke CLI with --backends [user list]
CLI->>Norm: normalize_backends(backends)
Note right of Norm `#DDFFDD`: replace deprecated\n"trtllm-gen-native" → "trtllm-native"\nand emit warning
Norm-->>CLI: normalized backends
CLI->>Runner: start benchmark with normalized backends
Runner-->>User: run results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 2 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @bkryu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request extends the microbenchmarking framework to support performance evaluation of XQA decode kernels, specifically Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for benchmarking the XQA decode kernel (trtllm-gen-native backend) on SM90 and SM120 architectures. The changes correctly update the routine_cc_to_supported_backends dictionary to include this new backend for the specified compute capabilities. The addition of comments explaining which underlying functions are called by the trtllm-gen-native backend is a good step towards improving code clarity. I have one suggestion to enhance maintainability by applying this commenting practice consistently for other backends as well.
| routine_cc_to_supported_backends = { | ||
| # ATTENTION | ||
| "BatchDecodeWithPagedKVCacheWrapper": { | ||
| # NOTE: trtllm-gen-native calls trtllm_batch_decode_with_kv_cache |
There was a problem hiding this comment.
These new comments explaining which function the trtllm-gen-native backend calls are very helpful for understanding the code. However, this information is only provided for the trtllm-gen-native backend, while other backends like fa2, cudnn, and trtllm-gen lack similar explanations.
For consistency and better maintainability, I suggest adding similar comments for the other backends across all routines in this dictionary. This will make it easier for future contributors to understand the purpose of each backend.
For example, for BatchDecodeWithPagedKVCacheWrapper:
"BatchDecodeWithPagedKVCacheWrapper": {
# NOTE: 'fa2' uses FlashAttention-2, 'cudnn' uses cuDNN SDPA, etc.
# NOTE: 'trtllm-gen-native' calls trtllm_batch_decode_with_kv_cache
"7.5": ["fa2"],
# ...
},Alternatively, if this dictionary is expected to grow, consider refactoring this information into a more structured format, like a separate dictionary mapping routines and backends to the functions they call. This would be more scalable.
| "8.6": ["fa2", "fa2_tc", "cudnn"], | ||
| "8.9": ["fa2", "fa2_tc", "cudnn"], | ||
| "9.0": ["fa2", "fa2_tc", "cudnn"], | ||
| "9.0": ["fa2", "fa2_tc", "cudnn", "trtllm-gen-native"], |
There was a problem hiding this comment.
I prefer to remove "gen" from trtllm, trtllm-gen is the codegen framework designed specifically for sm_100 and sm_103, and for other backends we are not going through trtllm-gen.
There was a problem hiding this comment.
Agree with your sentiment for the trtllm-gen-native that call trtllm_batch_... prefill/decode/MLA APIs. For the trtllm kernel called through wrappers, I still would like to call them trtllm-gen since it is the actual backend for them as stated in our documentation. For example:
backend (str) – The implementation backend, could be auto/fa2 or trtllm-gen. Defaults to auto. If set to auto, the wrapper will automatically choose the backend based on the device architecture and kernel availability.
In the latest commit, I updated so that:
- The microbenchmark backend
trtllm-gen-nativeis renamed totrtllm-native. - Added a check that replaces user-provided
trtllm-gen-nativetotrtllm-nativeand prints a warning message about future deprecation. - Replaced pretty much all instances of
trtllm-gen-nativetotrtllm-native
There was a problem hiding this comment.
I see, then I suppose we should update the unified attention wrapper as well (in a future PR), thanks for spotting this issue!
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
benchmarks/README.md(3 hunks)benchmarks/routines/attention.py(13 hunks)benchmarks/routines/flashinfer_benchmark_utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- benchmarks/routines/flashinfer_benchmark_utils.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (3)
benchmarks/routines/attention.py (2)
93-94: LGTM! Well-designed backward compatibility.The approach of accepting both the canonical and deprecated backend names in the
choiceslist (lines 93-94), followed by runtime normalization (lines 179-181), provides smooth backward compatibility while guiding users toward the new naming convention.Also applies to: 179-181
217-217: LGTM! Consistent backend name usage.All references to the TensorRT-LLM native backend have been consistently updated to use
"trtllm-native"throughout the file, including docstrings, condition checks, and backend validation logic.Also applies to: 522-522, 646-646, 729-735, 987-987, 1210-1224, 1436-1436, 1570-1570, 1666-1674, 1839-1839
benchmarks/README.md (1)
120-120: LGTM! Documentation properly updated.The documentation has been consistently updated to reflect the canonical backend name
trtllm-nativeacross the general backends list (line 120), the routine support matrix (lines 220-223), and the backend legend (line 241). This aligns with the code changes and provides clear guidance to users.Also applies to: 220-223, 241-241
…-ai#2055) <!-- .github/pull_request_template.md --> ## 📌 Description In flashinfer-ai#2001 , XQA decode kernels became available through `trtllm_batch_decode_with_kv_cache` on SM90 and SM120. Current PR adds the ability to benchmark through the microbenchmark. Example microbenchmark command and outputs before and after: ``` ### Before current PR: ## SM90 (H200) $ python3 flashinfer_benchmark.py --routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 trtllm-gen-native cudnn --page_size 32 --batch_size 1 --s_qo 1 --s_kv 8192 --num_qo_heads 64 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128 --q_dtype bfloat16 --kv_dtype bfloat16 --refcheck --use_cupti [WARNING] trtllm-gen-native for routine BatchDecodeWithPagedKVCacheWrapper is not supported on compute capability 9.0. Skipping. [PERF] fa2 :: median time 0.035 ms; std 0.002 ms; achieved tflops 7.721 TFLOPs/sec; achieved tb_per_sec 0.966 TB/sec [PERF] cudnn :: median time 0.020 ms; std 0.000 ms; achieved tflops 13.519 TFLOPs/sec; achieved tb_per_sec 1.692 TB/sec ## SM120 (RTX 5090) $ python3 flashinfer_benchmark.py --routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 trtllm-gen-native cudnn --page_size 32 --batch_size 1 --s_qo 1 --s_kv 8192 --num_qo_heads 64 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128 --q_dtype bfloat16 --kv_dtype bfloat16 --refcheck --use_cupti [WARNING] trtllm-gen-native for routine BatchDecodeWithPagedKVCacheWrapper is not supported on compute capability 12.0. Skipping. [PERF] fa2 :: median time 0.033 ms; std 0.001 ms; achieved tflops 8.204 TFLOPs/sec; achieved tb_per_sec 1.027 TB/sec [PERF] cudnn :: median time 0.030 ms; std 0.000 ms; achieved tflops 8.943 TFLOPs/sec; achieved tb_per_sec 1.119 TB/sec ### After current PR: ## SM90 (H200) $ python3 flashinfer_benchmark.py --routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 trtllm-gen-native cudnn --page_size 32 --batch_size 1 --s_qo 1 --s_kv 8192 --num_qo_heads 64 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128 --q_dtype bfloat16 --kv_dtype bfloat16 --refcheck --use_cupti [PERF] fa2 :: median time 0.035 ms; std 0.002 ms; achieved tflops 7.721 TFLOPs/sec; achieved tb_per_sec 0.966 TB/sec [PERF] trtllm-gen-nati:: median time 0.019 ms; std 0.002 ms; achieved tflops 13.820 TFLOPs/sec; achieved tb_per_sec 1.729 TB/sec [PERF] cudnn :: median time 0.020 ms; std 0.000 ms; achieved tflops 13.574 TFLOPs/sec; achieved tb_per_sec 1.698 TB/sec ## SM120 (RTX 5090) $ python3 flashinfer_benchmark.py --routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 trtllm-gen-native cudnn --page_size 32 --batch_size 1 --s_qo 1 --s_kv 8192 --num_qo_heads 64 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128 --q_dtype bfloat16 --kv_dtype bfloat16 --refcheck --use_cupti [PERF] fa2 :: median time 0.033 ms; std 0.001 ms; achieved tflops 8.121 TFLOPs/sec; achieved tb_per_sec 1.016 TB/sec [PERF] trtllm-gen-nati:: median time 0.034 ms; std 0.001 ms; achieved tflops 7.903 TFLOPs/sec; achieved tb_per_sec 0.989 TB/sec [PERF] cudnn :: median time 0.030 ms; std 0.001 ms; achieved tflops 9.020 TFLOPs/sec; achieved tb_per_sec 1.129 TB/sec ``` <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Standardized backend identifier to "trtllm-native" and expanded its support across benchmark routines and utilities. * Argument parsing now canonicalizes deprecated backend aliases and emits a deprecation warning when encountered. * **Documentation** * README and tool-facing messages updated to use the canonical backend name and include contextual notes about the change. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
📌 Description
In #2001 , XQA decode kernels became available through
trtllm_batch_decode_with_kv_cacheon SM90 and SM120.Current PR adds the ability to benchmark through the microbenchmark.
Example microbenchmark command and outputs before and after:
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit