Skip to content

Update FA to latest upstream#28

Merged
Fridge003 merged 18 commits intosgl-kernelfrom
upstream
Jan 10, 2026
Merged

Update FA to latest upstream#28
Fridge003 merged 18 commits intosgl-kernelfrom
upstream

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

No description provided.

tridao and others added 18 commits December 31, 2025 18:15
Previous we signal per warp group, but that makes the code more complicated
for a tiny bit of perf gain.
* improved block sparsity computation

* refactor blocksparsity computation for tvm-ffi

* refactor mask mod definitions and tests

* refactor of block sparsity and mask mod application; eventually allow varlen

* remove fastdivmods from compute block sparsity

* remove unnecessary imports

* revert to 1-phase block sparsity computation

* update bwd kernels to use new AttentionMaskCls api

* fix linter error
* use q_stage=1 for split kv

* determine q_stage via seqlen_q for sm100

* repurpose softmax1 warps for cp.async load

* address comments
* [Cute] Add missing COMPUTE_CAPABILITY definition in test_score_mod.py

The paged KV cache tests (test_score_mod_with_paged_kvcache and
test_score_mod_with_paged_kvcache_aux_tensors) check COMPUTE_CAPABILITY
to skip tests on SM90 since paged KV cache is only supported on SM100.
However, the variable was never defined, causing a NameError.

This adds the same definition used in test_mask_mod.py:
COMPUTE_CAPABILITY = torch.cuda.get_device_capability()[0]

* [Cute] Fix missing seqlen_info parameter in mask_mod call

The mask_mod call in apply_mask_sm100_transposed was missing the
seqlen_info parameter. All mask functions expect the signature:
(batch, head, m_idx, n_idx, seqlen_info, aux_tensors)

The other two mask_mod calls in the same file correctly pass all 6
arguments, but this one only passed 5, causing:
TypeError: cute_ima_mask() missing 1 required positional argument: 'aux_tensors'

This fixes test_mask_mod.py::test_mask_mod_ima_partial_block.
* varlen bwd with rounded padded offsets

* fix mha

* change offset mode to round down multiple

* enable varlen bwd tests

* enable deterministic mode

* fix deadlock and switch mha to no postprocess

* reenable tests

* fix lint error

* use head swizzle/spt for deterministic, update tests

* change padding offset based on arch

* rebase and update interface, tests

* add arch dispatch for padded offset q to postprocess

* address comments

* remove tile sizes from seqlen info class vars
@Fridge003 Fridge003 merged commit f866ec3 into sgl-kernel Jan 10, 2026
3 of 4 checks passed
@Fridge003 Fridge003 deleted the upstream branch January 10, 2026 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants