Fix: several bugs/issues with trtllm-gen attention kernels. by PerkzZheng · Pull Request #2062 · flashinfer-ai/flashinfer

PerkzZheng · 2025-11-07T09:29:34Z

📌 Description

This MR fixes:

unspecified cuda launch errors with 2CTA MLA kernels
masking bug of SWA decode kernels.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added Sparse MLA support and propagated its flag through kernel selection and dispatch.
Bug Fixes / Improvements
- Enforced power-of-two page sizing for paged KV caches and tightened head-dimension limits for broader hardware compatibility.
- Updated kernel trait encoding and hash construction to include the sparse MLA flag and revised bit-field layout.
Chores
- Updated runtime kernel artifact identifiers and checksums.
- Extended kernel parameter fields, zero-initialized params on setup, and populated tokens-per-page log2 for paged KV.

gemini-code-assist · 2025-11-07T09:29:51Z

Summary of Changes

Hello @PerkzZheng, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers crucial fixes for trtllm-gen attention kernels, resolving critical CUDA launch errors in 2CTA MLA kernels and a masking issue affecting SWA decode kernels. Beyond these immediate bug resolutions, the changes also update artifact references and introduce several new parameters to the kernel configuration, improving the robustness and future extensibility of the attention mechanisms, particularly for sparse attention and memory management.

Highlights

Kernel Bug Fixes: Addressed unspecified CUDA launch errors in 2CTA MLA kernels and a masking bug in SWA decode kernels.
Artifact Updates: Updated paths and checksums for trtllm-gen FMHA and BMM artifacts to reflect the latest versions.
Kernel Parameter Refinements: Modified TllmGenFmhaKernel::hashID to include a sparseMla parameter, adjusted head dimension limits from 2048 to 1024, and enforced numTokensPerPage to be a power of 2.
New Kernel Parameters: Introduced ptrReservedMem, mNumTokensPerPageLog2, mReservedParam, mSparseMlaTopK, and mUseBlockSparseAttention in KernelParams for enhanced functionality and explicit zero-initialization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2025-11-07T09:31:02Z

Walkthrough

Updates TRTLLM FMHA artifact path and checksum constants; extends FMHA kernel hash encoding to include a new sparseMla flag with adjusted bit-field layout and stricter head-dimension checks; and adds paged-KV / sparse-related fields to KernelParams with zero-initialization and log2 computation for numTokensPerPage.

Changes

Cohort / File(s)	Change Summary
Artifact constants `flashinfer/artifacts.py`	Replaced `ArtifactPath.TRTLLM_GEN_FMHA` value and updated `CheckSumHash.TRTLLM_GEN_FMHA` checksum string.
FMHA kernel hash & encoding `include/flashinfer/trtllm/fmha/fmhaKernels.cuh`	Added `sparseMla` bool to `hashID` and threaded through call sites; remapped bit-field shifts/offsets (tileSizeKv, log2(numTokensPerPage), etc.); added sparseMla bit; tightened head-dim upper bounds (2048→1024); enforced power-of-two for numTokensPerPage.
Kernel parameters struct & init `include/flashinfer/trtllm/fmha/kernelParams.h`	Added `ptrReservedMem` (`int32_t*`), `mNumTokensPerPageLog2` (`int32_t`), `mReservedParam` (`float`), `mSparseMlaTopK` (`int32_t`); `setKernelParams()` now zero-initializes `KernelParams` and validates/computes `mNumTokensPerPageLog2` for paged KV.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Runner as Runner / Dispatch
    participant Selector as Kernel Selector
    participant Meta as KernelMeta
    participant Loader as Kernel Loader

    Note over Runner,Selector: Build selection key from runtime params
    Runner->>Selector: hashFromRunnerParams(params, /* sparseMla */ false)
    Selector->>Meta: select candidate KernelMeta
    Note right of Meta: KernelMeta includes mSparseMla
    Selector->>Loader: hashID(kernelMeta, sparseMla=Meta.mSparseMla)
    Loader->>Loader: assemble 64-bit hash (includes sparseMla bit, log2(numTokensPerPage))
    Loader->>Runner: return selected kernel / load artifacts (uses updated artifact checksum)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas needing careful review:
- include/flashinfer/trtllm/fmha/fmhaKernels.cuh — bit offsets/packing, inclusion of sparseMla bit, and head-dim limit changes.
- All hashID call sites — ensure consistent propagation of the new boolean (real value vs placeholder).
- include/flashinfer/trtllm/fmha/kernelParams.h — zero-initialization safety (memset) and correct power-of-two/log2 handling.
- flashinfer/artifacts.py — verify artifact path/checksum strings against authoritative source.

Possibly related PRs

misc: Update artifacts docstring and MetaInfoHash #1967 — modifies the same TRTLLM_GEN_FMHA artifact path and checksum constants in flashinfer/artifacts.py.
fix: Fix trtllm-gen prefill IMA when batch_size==1 #1912 — also updates TRTLLM_GEN_FMHA identifiers/checksum constants in flashinfer/artifacts.py.

Suggested reviewers

aleozlx
joker-eph
cyx-6
nvmbreughe

Poem

🐇
I nudged the bits and tucked a flag inside,
Pages now counted, tiles neatly spied,
Hashes hum truer, artifacts align,
Kernels hop ready — sparse, swift, and fine,
A carrot-coded patch, all snugly tied. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix: several bugs/issues with trtllm-gen attention kernels' is specific and directly related to the changeset, which updates artifacts for trtllm-gen and adds parameters related to sparse MLA to kernel files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The pull request description follows the template structure but has incomplete sections. Related Issues and Reviewer Notes are empty.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates artifact hashes and refines the kernel selection logic for trtllm-gen attention kernels. Key changes include adding a sparseMla parameter to the hashID function, adjusting bit shifts for head dimensions, and enforcing that numTokensPerPage must be a power of 2. New members have been added to the KernelParams struct to support these changes, and the struct is now explicitly zero-initialized using memset for improved safety. These modifications appear to address the reported CUDA launch errors and masking bugs, enhancing the robustness and correctness of the attention kernels.

yzh119 · 2025-11-07T10:04:01Z

@PerkzZheng would you mind rebasing to main branch? Seems there are some merge conflicts.

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

PerkzZheng · 2025-11-07T10:40:19Z

@PerkzZheng would you mind rebasing to main branch? Seems there are some merge conflicts.

it was rebased to a wrong remote. It should be good now. Thanks

pavanimajety

Thanks for the PR

nvmbreughe

LGTM. Just wondering: for what config did we get failures without this fix? I think it would be good to have a test. I can add it after this PR.

yzh119 · 2025-11-07T20:47:13Z

    )
    TRTLLM_GEN_BMM: str = (
-        "46ccf0492e3ed10135c2861a4f4ef9bb45846610f9a9d2ccaf2d5bf01d2006fd"
+        "1ebace613389a4f2e10b14315da5d522642c5dcaae23f01213d56c59068f148b"


Why do we need to update the BMM hash in this PR?

yzh119 · 2025-11-08T00:19:32Z

/bot run

flashinfer-bot · 2025-11-08T00:20:36Z

GitLab MR !122 has been created, and the CI pipeline #38107936 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2025-11-08T07:56:14Z

[FAILED] Pipeline #38107936: 7/17 passed

yzh119 · 2025-11-08T20:17:56Z

/bot run

flashinfer-bot · 2025-11-08T20:18:45Z

GitLab MR !122 has been updated with latest changes, and the CI pipeline #38135771 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2025-11-09T00:19:56Z

[FAILED] Pipeline #38135771: 14/17 passed

…er-ai#2062)  ## 📌 Description This MR fixes: 1. unspecified cuda launch errors with 2CTA MLA kernels 2. masking bug of SWA decode kernels. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Added Sparse MLA support and propagated its flag through kernel selection and dispatch. * **Bug Fixes / Improvements** * Enforced power-of-two page sizing for paged KV caches and tightened head-dimension limits for broader hardware compatibility. * Updated kernel trait encoding and hash construction to include the sparse MLA flag and revised bit-field layout. * **Chores** * Updated runtime kernel artifact identifiers and checksums. * Extended kernel parameter fields, zero-initialized params on setup, and populated tokens-per-page log2 for paged KV.  --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: yzh119 <zihaoy@nvidia.com> Co-authored-by: Zihao Ye <expye@outlook.com>

PerkzZheng requested review from aleozlx, cyx-6, joker-eph, nvmbreughe, wenscarl, yongwww and yzh119 as code owners November 7, 2025 09:29

gemini-code-assist Bot reviewed Nov 7, 2025

View reviewed changes

yzh119 reviewed Nov 7, 2025

View reviewed changes

Comment thread include/flashinfer/trtllm/fmha/fmhaKernels.cuh

update trtllm-gen to fix several issues

e4d7f46

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

PerkzZheng force-pushed the user/perkzz/update-trtllm-gen-1107 branch from 8dc0a1b to e4d7f46 Compare November 7, 2025 10:39

pavanimajety approved these changes Nov 7, 2025

View reviewed changes

nvmbreughe approved these changes Nov 7, 2025

View reviewed changes

yzh119 reviewed Nov 7, 2025

View reviewed changes

upd

c8cd5db

yzh119 and others added 2 commits November 7, 2025 16:21

upd

2c5acaa

Merge branch 'main' into user/perkzz/update-trtllm-gen-1107

052b894

upd

5b6e562

yzh119 approved these changes Nov 9, 2025

View reviewed changes

yzh119 merged commit d56748f into flashinfer-ai:main Nov 9, 2025
4 checks passed

This was referenced Nov 24, 2025

feat: add trtllm-gen per-tensor sparseMla kernels. #2138

Merged

fix(trtllm): reset negative strideBatch to 0 for ragged KV layout to … #2134

Merged

trevor-m mentioned this pull request Dec 4, 2025

[Feature] Integrate new flashinfer optimizations for DeepSeekV3 sgl-project/sglang#14453

Open

coderabbitai Bot mentioned this pull request Dec 24, 2025

[TRTLLM-Gen Fmha] add optimized trtllm-gen decode kernels for high throughput + speculative decoding #2265

Merged

5 tasks

coderabbitai Bot mentioned this pull request Feb 3, 2026

feat: Add TRTLLM-Gen Skip-Softmax kernels for prefill and decode #2477

Merged

5 tasks

coderabbitai Bot mentioned this pull request Mar 6, 2026

feat: Add DiT-oriented kernels where Qk (Bmm1) type can be reinterpreted into Int8 or BFloat16 #2711

Merged

5 tasks

coderabbitai Bot mentioned this pull request Apr 2, 2026

[Fmha] revert blackwell ultra optimization that causes deadlocks. #2956

Merged

5 tasks

coderabbitai Bot mentioned this pull request Apr 16, 2026

[Fmha] update trtllm-gen FMHA cubins and sync headers for context SWA fix #3089

Merged

5 tasks

coderabbitai Bot mentioned this pull request May 1, 2026

Enable TRT-LLM Gen sparse MLA block-sparse path #3199

Open

coderabbitai Bot mentioned this pull request May 9, 2026

Add dynamic tokens-per-page TRTLLM-GEN GQA kernels #3259

Open

Conversation

PerkzZheng commented Nov 7, 2025 • edited by yzh119 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot commented Nov 7, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai Bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yzh119 commented Nov 7, 2025

Uh oh!

PerkzZheng commented Nov 7, 2025

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

nvmbreughe left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Nov 8, 2025

Uh oh!

flashinfer-bot commented Nov 8, 2025

Uh oh!

flashinfer-bot commented Nov 8, 2025

Uh oh!

yzh119 commented Nov 8, 2025

Uh oh!

flashinfer-bot commented Nov 8, 2025

Uh oh!

flashinfer-bot commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PerkzZheng commented Nov 7, 2025 •

edited by yzh119

Loading

coderabbitai Bot commented Nov 7, 2025 •

edited

Loading