[NVIDIA] upstream FA4 by johnnynunez · Pull Request #15182 · sgl-project/sglang

johnnynunez · 2025-12-15T11:04:38Z

Added num_splits_heuristic function (lines 39-46)
Heuristic to determine the number of splits for split KV computation
Added new parameters to _flash_attn_fwd
num_splits: int = 1 (line 70)
mask_mod: Optional[Callable] = None (line 74)
Updated window_size handling logic (lines 225-236)
Supports mask_mod parameter
Handles causal/local window logic with mask_mod
Added split KV logic (lines 237-260)
Computes is_split_kv based on num_splits
Creates out_partial and lse_partial tensors when split KV is enabled
Automatic split calculation when num_splits < 1
Updated compile key (lines 350-376)
Added score_mod_hash and mask_mod_hash
Added is_split_kv flag
Added paged_kv_non_tma flag
Added buffer count instead of aux_tensors count
Updated kernel initialization calls
FlashAttentionForwardSm90 (lines 382-401): Added mask_mod, intra_wg_overlap=True, mma_pv_is_rs=True
FlashAttentionForwardSm100 (lines 407-428): Added mask_mod, is_split_kv, paged_kv_non_tma, is_varlen_q, m_block_size, n_block_size
Added _flash_attn_fwd_combine function (lines 486-616)
Combines partial outputs from split KV computation
Includes caching logic
Added combine call (lines 473-481)
Calls _flash_attn_fwd_combine when is_split_kv is True
Added import (line 25)
from flash_attn_origin.cute.flash_fwd_combine import FlashAttentionForwardCombine

…iciency

gemini-code-assist · 2025-12-15T11:05:04Z

Summary of Changes

Hello @johnnynunez, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant enhancements to the FlashAttention v4 implementation, primarily by adding robust support for split Key-Value (KV) computation and more flexible mask handling. These changes are designed to optimize performance and adaptability for various attention patterns and sequence lengths. The update also includes a new heuristic for determining optimal splits, refines the kernel compilation process to accommodate these advanced features, and integrates a new function to combine results from split operations.

Highlights

Split KV Heuristic: Introduced a new num_splits_heuristic function to dynamically determine the optimal number of splits for Key-Value (KV) computation, improving efficiency.
New Parameters for FlashAttention: Added num_splits and mask_mod as new parameters to the _flash_attn_fwd function, enhancing control over attention splitting and masking.
Enhanced Window Size Handling: Updated the window_size logic to support the new mask_mod parameter and to correctly handle causal and local attention mechanisms.
Split KV Logic Implementation: Implemented the core split KV logic, including the computation of is_split_kv, automatic split calculation when num_splits is less than 1, and the creation of out_partial and lse_partial tensors for intermediate results.
Updated Kernel Compile Key: Modified the kernel compile key to include new hashes for score_mod and mask_mod, along with flags for is_split_kv and paged_kv_non_tma, ensuring proper caching for diverse configurations.
Kernel Initialization Updates: Adjusted the initialization calls for FlashAttentionForwardSm90 and FlashAttentionForwardSm100 to pass the newly introduced mask_mod, is_split_kv, and other relevant parameters.
New Combine Function: Added a new function, _flash_attn_fwd_combine, responsible for merging partial outputs and log-sum-exp values from split KV computations into final results.
Integration of Combine Function: Integrated the _flash_attn_fwd_combine call into _flash_attn_fwd, ensuring that partial results are combined when split KV is enabled.
New Import: Included a new import statement for FlashAttentionForwardCombine to support the new combine functionality.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request brings in upstream changes from FlashAttention 4, notably adding support for split KV computation. This includes a new heuristic for determining the number of splits, new parameters to _flash_attn_fwd, updated logic for windowing and masking, and a new _flash_attn_fwd_combine function to merge partial results. The compile keys and kernel initializations are also updated accordingly. The changes are well-structured and the new functionality is a significant addition. I have one suggestion to improve input validation in the new combine function.

gemini-code-assist · 2025-12-15T11:06:56Z

+    for t, name in [
+        (cu_seqlens, "cu_seqlens"),
+        (seqused, "seqused"),
+        (num_splits_dynamic_ptr, "num_splits_dynamic_ptr"),
+    ]:


For completeness and safety, the semaphore_to_reset tensor should also be validated in this loop, similar to the other optional tensor arguments. While it's not used in the current call sites, adding this validation will prevent potential issues if it's used in the future.

Suggested change

for t, name in [

(cu_seqlens, "cu_seqlens"),

(seqused, "seqused"),

(num_splits_dynamic_ptr, "num_splits_dynamic_ptr"),

]:

for t, name in [

(cu_seqlens, "cu_seqlens"),

(seqused, "seqused"),

(num_splits_dynamic_ptr, "num_splits_dynamic_ptr"),

(semaphore_to_reset, "semaphore_to_reset"),

]:

…iciency

yinghai · 2025-12-21T03:59:40Z

+            device=device,
+        )
+        lse_partial = torch.empty(
+            num_splits, *lse_shape, dtype=torch.float32, device=device


is this cudagraph friendly?

Qiaolin-Yu · 2025-12-29T04:42:23Z

/tag-and-rerun-ci

johnnynunez · 2025-12-29T15:33:24Z

@Qiaolin-Yu im on vacation right now... i updated to last commit flash attention, thanks to the other fixes. The basics in sgl-kernel are working

Qiaolin-Yu · 2025-12-29T22:50:11Z

@Qiaolin-Yu im on vacation right now... i updated to last commit flash attention, thanks to the other fixes. The basics in sgl-kernel are working

Enjoy your vacation! Don’t worry, I’ll take care of wrapping things up.

Fridge003 · 2026-01-10T09:06:32Z

Will directly move to the latest upstream
sgl-project/sgl-flash-attn#28

Fridge003 · 2026-01-11T07:08:19Z

b200 tests passed with newest cutedsl and tvm-ffi
https://github.com/sgl-project/sglang/actions/runs/20890229706/job/60022405517?pr=15182

johnnynunez · 2026-01-11T07:10:03Z

b200 tests passed with newest cutedsl and tvm-ffi https://github.com/sgl-project/sglang/actions/runs/20890229706/job/60022405517?pr=15182

The last commit from FA4, changed a little bit the interface.py. We can update it, i think so

Qiaolin-Yu · 2026-01-11T07:18:27Z

Test results with #16034

python3 -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B
-Instruct-2507-FP8  --trust-remote-code --attention-backend fa4 --tp 4

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions
 1319 --parallel 1319 
100%|█████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:38<00:00, 34.36it/s]
Accuracy: 0.958
Invalid: 0.000
Latency: 38.535 s
Output throughput: 4972.801 token/s

Qiaolin-Yu · 2026-01-11T07:22:19Z

b200 tests passed with newest cutedsl and tvm-ffi https://github.com/sgl-project/sglang/actions/runs/20890229706/job/60022405517?pr=15182

The last commit from FA4, changed a little bit the interface.py. We can update it, i think so

I have updated it.

johnnynunez · 2026-01-11T07:25:13Z

b200 tests passed with newest cutedsl and tvm-ffi https://github.com/sgl-project/sglang/actions/runs/20890229706/job/60022405517?pr=15182

The last commit from FA4, changed a little bit the interface.py. We can update it, i think so

I have updated it.

thanks, that fix some bugs in sm100

feat: add split attention support and combine kernel for improved eff…

6f77a23

…iciency

johnnynunez requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy, yizhang2077 and zhyncs as code owners December 15, 2025 11:04

github-actions Bot added the sgl-kernel label Dec 15, 2025

gemini-code-assist Bot reviewed Dec 15, 2025

View reviewed changes

feat: add split attention support and combine kernel for improved eff…

dc69d02

…iciency

yinghai reviewed Dec 21, 2025

View reviewed changes

fix

a77a908

Qiaolin-Yu mentioned this pull request Dec 29, 2025

Support fa4 decoding #16034

Merged

6 tasks

Qiaolin-Yu assigned Fridge003 Dec 29, 2025

Merge branch 'main' into flash-attention

f9a63cd

github-actions Bot added the run-ci label Dec 29, 2025

johnnynunez and others added 2 commits December 29, 2025 15:48

Merge branch 'sgl-project:main' into flash-attention

539f039

Update flash-attention repository tag

fcc5114

johnnynunez changed the title ~~[NVIDIA] upstream FA4 12/15/25~~ [NVIDIA] upstream FA4 12/29/25 Dec 29, 2025

Merge branch 'sgl-project:main' into flash-attention

d311f4a

upd

c8bbf66

Qiaolin-Yu force-pushed the flash-attention branch from a0f7949 to c8bbf66 Compare December 29, 2025 22:51

johnnynunez and others added 3 commits December 29, 2025 23:58

enable jetson agx thor FA4

e62e8c4

Merge branch 'main' into flash-attention

e5a062e

upd

f5b5954

Merge branch 'sgl-project:main' into flash-attention

07f1199

Fridge003 mentioned this pull request Jan 8, 2026

Update flashinfer to 0.6.1 #15551

Merged

6 tasks

Fridge003 added the high priority label Jan 9, 2026

Fridge003 mentioned this pull request Jan 10, 2026

Update Cutedsl version and pin cuda-python version #16838

Merged

5 tasks

Fridge003 changed the title ~~[NVIDIA] upstream FA4 12/29/25~~ [NVIDIA] upstream FA4 Jan 10, 2026

Qiaolin-Yu added 2 commits January 10, 2026 09:28

upd

62f4b91

fix

4b24aeb

Qiaolin-Yu force-pushed the flash-attention branch from 9e274de to 4b24aeb Compare January 10, 2026 09:37

Qiaolin-Yu requested review from Qiaolin-Yu, hebiao064 and ishandhanani as code owners January 10, 2026 09:37

Qiaolin-Yu and others added 7 commits January 10, 2026 09:40

upd

8e843fd

fix

38be65c

Merge remote-tracking branch 'upstream/main' into pr-15182

3c8eea7

Merge branch 'main' into flash-attention

1fe00ce

fix import

8b3a202

Merge branch 'main' into flash-attention

21afa2e

upd

6f87062

Qiaolin-Yu approved these changes Jan 11, 2026

View reviewed changes

little revert

b2ac975

Fridge003 reviewed Jan 11, 2026

View reviewed changes

Comment thread python/pyproject.toml Outdated

Update python/pyproject.toml

a02e62d

Fridge003 approved these changes Jan 11, 2026

View reviewed changes

Fridge003 merged commit b5493f6 into sgl-project:main Jan 11, 2026
18 of 45 checks passed

Conversation

johnnynunez commented Dec 15, 2025

Uh oh!

gemini-code-assist Bot commented Dec 15, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

yinghai Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Qiaolin-Yu commented Dec 29, 2025

Uh oh!

johnnynunez commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qiaolin-Yu commented Dec 29, 2025

Uh oh!

Fridge003 commented Jan 10, 2026

Uh oh!

Fridge003 commented Jan 11, 2026

Uh oh!

johnnynunez commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qiaolin-Yu commented Jan 11, 2026

Uh oh!

Qiaolin-Yu commented Jan 11, 2026

Uh oh!

johnnynunez commented Jan 11, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

johnnynunez commented Dec 29, 2025 •

edited

Loading

johnnynunez commented Jan 11, 2026 •

edited

Loading