[PyTorch] Enable head dim 256 for FA4 by yaox12 · Pull Request #2932 · NVIDIA/TransformerEngine

yaox12 · 2026-04-27T09:29:44Z

Description

Need FA4 version 4.0.0b11.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-04-27T09:33:35Z

Greptile Summary

This PR enables head_dim=256 support for FlashAttention 4 on SM100/SM103 GPUs by delegating head-dimension validation to FA4's own _validate_head_dims function instead of maintaining a parallel static guard in TE, and bumps the required FA4 version to 4.0.0b11.

backends.py: _validate_head_dims is imported alongside flash_attn_func/flash_attn_varlen_func in a single grouped import; if absent in an older FA4 install, an uncaught ImportError crashes the entire module load (previously flagged).
utils.py: Replaces the static per-arch head-dim check with a live call to FA4's validator; adds an SM100 cross-attention fallback for hd256 shapes; the MLA misalignment workaround is preserved as independent if checks; v4_installation_steps is correctly updated to 4.0.0b11.
test_attention.py: Adds test_dpa_fa4_hdim256 with an explicit SM100/SM103 skipif guard, and removes stale cuDNN version checks from all FA4 tests.

Confidence Score: 4/5

The core logic in utils.py is sound, but the grouped import in backends.py will crash the entire TE module load for any user who has FA4 installed at a version older than 4.0.0b11.

The import in backends.py bundles _validate_head_dims into the same grouped block as the two core FA4 functions. Any FA4 install older than 4.0.0b11 that lacks this symbol triggers an unhandled ImportError at module load time, making TE unusable for those users.

transformer_engine/pytorch/attention/dot_product_attention/backends.py — the grouped FA4 import is the critical path that warrants a second look before merging.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/attention/dot_product_attention/backends.py	Adds `_validate_head_dims` to the same grouped import as `flash_attn_func`/`flash_attn_varlen_func`; an `ImportError` on older FA4 (pre-4.0.0b11) crashes the entire module load rather than gracefully falling back.
transformer_engine/pytorch/attention/dot_product_attention/utils.py	Replaces static head-dim guard with a live call to FA4's `_validate_head_dims`; adds SM100 cross-attention fallback for hd256; MLA workaround restructured as independent `if` checks; `v4_installation_steps` updated to 4.0.0b11.
tests/pytorch/attention/test_attention.py	Adds dedicated `test_dpa_fa4_hdim256` with explicit SM100/SM103 skip guard; removes stale cuDNN version checks from all FA4 tests.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_attention_backend called] --> B{use_flash_attention_4 and v4_is_installed and v4_validate_head_dims is not None?}
    B -- No --> Z[Skip FA4 head-dim check]
    B -- Yes --> C[Compute _fa4_alignment]
    C --> D[Call v4_validate_head_dims]
    D -- AssertionError --> E[Disable FA4]
    D -- OK --> F{SM100 AND hd256 AND seqlen_q != seqlen_kv?}
    F -- Yes --> G[Disable FA4 cross-attn hd256]
    F -- No --> H{Training AND MLA AND SM100?}
    H -- Yes --> I[gcd misalignment check]
    I -- Misaligned --> J[Disable FA4 MLA bwd]
    I -- OK --> K[FA4 enabled]
    H -- No --> K

_{Reviews (8): Last reviewed commit: "Merge branch 'main' into xiny/headdim256..." | Re-trigger Greptile}

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 · 2026-05-06T02:58:57Z

/te-ci pytorch L3

yaox12 · 2026-05-06T03:03:35Z

@vcherepanov-nv @KshitijLakhani Please review.

sudhakarsingh27 · 2026-05-08T15:46:02Z

+        # dV TMEM load atoms. When (tile_hdimv // 2) % dK_reduce_ncol != 0, dV reads are
+        # misaligned. The dedicated (256, 256) kernel uses its own tmem layout so it's
+        # not affected. See: flash_attn/cute/flash_bwd_sm100.py, line ~262 and ~3890.
+        if (


Should this still be checked when FlashAttentionUtils.v4_validate_head_dims == None?

I double checked that this is a bug of FA4. Kernels produce wrong results on these shapes but they're allowed by v4_validate_head_dims, so we have to filter them out manually.
Raise an issue to FA4. Dao-AILab/flash-attention#2552

vcherepanov-nv · 2026-05-08T23:10:39Z

LGTM

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 · 2026-05-11T05:36:51Z

/te-ci pytorch L3

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 · 2026-05-12T03:55:10Z

/te-ci pytorch L3

yaox12 · 2026-05-15T08:57:15Z

/te-ci pytorch L3

sudhakarsingh27

LGTM, pending CI

yaox12 · 2026-05-18T05:55:06Z

B200 test failed with 1 element mismatch. It should be irrelevant to this PR because I saw similar errors in other pipelines.

sudhakarsingh27 · 2026-05-18T23:50:04Z

~~Need to manually run L1 tests, triggering now~~
Doesn't look like it's needed

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 · 2026-05-24T13:33:49Z

/te-ci pytorch L3

sudhakarsingh27

LGTM

* enable head dim 256 for FA4 Signed-off-by: Xin Yao <xiny@nvidia.com> * update CI, fix lint, resolve comments Signed-off-by: Xin Yao <xiny@nvidia.com> * resolve comments Signed-off-by: Xin Yao <xiny@nvidia.com> * update filter Signed-off-by: Xin Yao <xiny@nvidia.com> --------- Signed-off-by: Xin Yao <xiny@nvidia.com>

* enable head dim 256 for FA4 Signed-off-by: Xin Yao <xiny@nvidia.com> * update CI, fix lint, resolve comments Signed-off-by: Xin Yao <xiny@nvidia.com> * resolve comments Signed-off-by: Xin Yao <xiny@nvidia.com> * update filter Signed-off-by: Xin Yao <xiny@nvidia.com> --------- Signed-off-by: Xin Yao <xiny@nvidia.com> Signed-off-by: yangfan.bai <yangfan.bai@shopee.com>

yaox12 marked this pull request as draft April 27, 2026 09:31

yaox12 force-pushed the xiny/headdim256_fa branch from bdcc02e to 3b3f7d0 Compare April 27, 2026 09:31

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/pytorch/attention/test_attention.py Outdated

enable head dim 256 for FA4

3d0fcd7

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 force-pushed the xiny/headdim256_fa branch from 3b3f7d0 to 9a93156 Compare May 6, 2026 02:44

update CI, fix lint, resolve comments

8aa5242

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 force-pushed the xiny/headdim256_fa branch from ae74e44 to 8aa5242 Compare May 6, 2026 02:55

yaox12 marked this pull request as ready for review May 6, 2026 02:59

KshitijLakhani requested a review from mk-61 May 8, 2026 06:34

sudhakarsingh27 reviewed May 8, 2026

View reviewed changes

yaox12 added 2 commits May 10, 2026 22:28

resolve comments

ad00e76

Signed-off-by: Xin Yao <xiny@nvidia.com>

Merge branch 'main' into xiny/headdim256_fa

472c9dd

yaox12 added 2 commits May 12, 2026 10:30

Merge branch 'main' into xiny/headdim256_fa

3090b57

update filter

7e9faf1

Signed-off-by: Xin Yao <xiny@nvidia.com>

Merge branch 'main' into xiny/headdim256_fa

8fafa1f

yaox12 requested a review from cyanguwa as a code owner May 13, 2026 10:42

Merge branch 'main' into xiny/headdim256_fa

12806c3

sudhakarsingh27 previously approved these changes May 15, 2026

View reviewed changes

sudhakarsingh27 reviewed May 15, 2026

View reviewed changes

sudhakarsingh27 self-requested a review May 15, 2026 20:52

sudhakarsingh27 added the 2.16.0 label May 18, 2026

Merge branch 'main' into xiny/headdim256_fa

a7f66f6

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 dismissed sudhakarsingh27’s stale review via a7f66f6 May 24, 2026 13:33

github-actions Bot added the org-contribution label May 24, 2026

wplf mentioned this pull request May 27, 2026

[dev] [5/5] Qwen3.5 support: Qwen3.5-VL training example NVIDIA/Megatron-LM#4751

Merged

4 tasks

sudhakarsingh27 approved these changes May 27, 2026

View reviewed changes

sudhakarsingh27 merged commit 5f1eaff into NVIDIA:main May 27, 2026
25 of 27 checks passed

Conversation

yaox12 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

yaox12 commented May 6, 2026

Uh oh!

yaox12 commented May 6, 2026

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

yaox12 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

vcherepanov-nv commented May 8, 2026

Uh oh!

yaox12 commented May 11, 2026

Uh oh!

yaox12 commented May 12, 2026

Uh oh!

yaox12 commented May 15, 2026

Uh oh!

sudhakarsingh27 left a comment

Choose a reason for hiding this comment

Uh oh!

yaox12 commented May 18, 2026

Uh oh!

sudhakarsingh27 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaox12 commented May 24, 2026

Uh oh!

sudhakarsingh27 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaox12 commented Apr 27, 2026 •

edited

Loading

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading

sudhakarsingh27 commented May 18, 2026 •

edited

Loading