Enable blockwise FP8 training kernels on AMD GPUs (MI300/MI350) by brucechanglongxu · Pull Request #3996 · pytorch/ao

brucechanglongxu · 2026-03-04T23:16:30Z

Replace hardcoded FP8 e4m3fn max (448.0) with a parameterized FP8_MAX derived from torch.finfo(dtype).max in all 5 blockwise quantization Triton kernels, their Python wrapper functions, and the 3 PyTorch reference implementations. This allows the kernels to operate with both float8_e4m3fn (NVIDIA, max=448) and float8_e4m3fnuz (AMD MI300, max=240).

kernels.py:

Add FP8_MAX as a tl.constexpr parameter to triton_fp8_blockwise_act_quant_lhs_kernel, triton_fp8_blockwise_act_quant_rhs_kernel, triton_fp8_blockwise_act_quant_transposed_lhs_kernel, triton_fp8_blockwise_weight_quant_rhs_kernel, and triton_fp8_blockwise_weight_quant_transposed_rhs_kernel. Each kernel previously had max_fp8_e4m3 = 448.0 / min_fp8_e4m3 = -448.0 inline; these are replaced with the passed-in FP8_MAX and -FP8_MAX.
In the 5 wrapper functions, compute fp8_max = torch.finfo(dtype).max and forward it to the kernel call. Widen the dtype assertion from [torch.float8_e4m3fn] to {torch.float8_e4m3fn, torch.float8_e4m3fnuz}. Default dtype parameter changed from torch.float8_e4m3fn to the platform-aware e4m3_dtype (from torchao.float8.config).
In the 3 reference implementations (torch_blockwise_scale_act_quant_lhs, torch_blockwise_scale_act_quant_rhs, torch_blockwise_scale_weight_quant), replace hardcoded torch.finfo(torch.float8_e4m3fn) with torch.finfo(dtype) and cast outputs to the passed dtype instead of torch.float8_e4m3fn.

test_blockwise_kernels.py:

Replace is_sm_at_least_90() capability gate with is_sm_at_least_90() || is_MI300() || is_MI350() across all 7 tests.
Replace hardcoded torch.float8_e4m3fn parametrize values with e4m3_dtype.
Remove @skip_if_rocm decorators from the 5 quantization kernel tests.

Benchmark Results (AMD Instinct MI300X)

Environment: PyTorch 2.9.1+rocm7.2.0, Triton 3.5.1+rocm7.2.0, single MI300X GPU

Correctness

All 9 test configurations produce bit-identical results between old (per-expert loop) and new (grouped GEMM kernel) paths (max_diff = 0.0).

GEMM Kernel Only: per-expert loop (old) vs grouped kernel (new)

E	M	K	N	Old (ms)	New (ms)	Speedup	Old TFLOPS	New TFLOPS
8	2048	1024	1024	2.503	0.227	11.03x	13.7	151.4
8	4096	2048	2048	2.798	0.817	3.42x	98.2	336.3
8	4096	4096	4096	7.026	4.139	1.70x	156.5	265.7
8	8192	4096	4096	11.157	9.026	1.24x	197.1	243.6
16	4096	2048	2048	5.149	0.794	6.49x	106.8	692.6
16	8192	4096	4096	13.708	7.461	1.84x	320.8	589.4
8	16384	4096	4096	21.724	18.096	1.20x	202.5	243.0
8	4096	5120	5120	12.629	5.693	2.22x	136.0	301.8
8	16640	5120	8192	55.225	40.689	1.36x	202.2	274.4

Full Forward: old Triton vs new Triton vs BF16 baseline

E	M	K	N	Old (ms)	New (ms)	BF16 (ms)	New/Old	New/BF16
8	2048	1024	1024	2.150	0.353	0.430	6.09x	1.22x
8	4096	2048	2048	2.968	1.118	0.480	2.66x	0.43x
8	4096	4096	4096	5.093	4.070	0.828	1.25x	0.20x
8	8192	4096	4096	8.105	7.700	2.167	1.05x	0.28x
16	4096	2048	2048	10.398	1.714	1.512	6.07x	0.88x
16	8192	4096	4096	19.715	8.302	1.832	2.37x	0.22x
8	16384	4096	4096	19.117	15.760	5.514	1.21x	0.35x
8	4096	5120	5120	15.924	8.565	2.432	1.86x	0.28x
8	16640	5120	8192	47.771	32.276	4.383	1.48x	0.14x

Forward+Backward (new Triton path, end-to-end)

E	M	K	N	Fwd+Bwd (ms)	TFLOPS
8	2048	1024	1024	4.149	24.8
8	4096	2048	2048	5.932	139.0
8	4096	4096	4096	15.226	216.6
8	8192	4096	4096	29.026	227.3
16	4096	2048	2048	15.334	107.6
16	8192	4096	4096	33.078	398.9

Key takeaways:

The new grouped GEMM kernel provides 1.2x-11x speedup over the old per-expert loop at the kernel level, with the largest gains on workloads with many experts and smaller per-expert M.
End-to-end forward speedup is 1.05x-6.1x over the old path (quantization overhead is now the dominant cost at larger sizes).
The Triton FP8 GEMM is currently slower than BF16 rocBLAS because it does not yet use hardware FP8 matrix cores; the benefit comes from reduced memory traffic which will be more impactful at scale.

cc: @BowenBao

Replace hardcoded FP8 e4m3fn max (448.0) with a parameterized FP8_MAX derived from torch.finfo(dtype).max in all 5 Triton JIT kernels, their wrapper functions, and the 3 PyTorch reference implementations. This allows the kernels to work with both float8_e4m3fn (NVIDIA, max=448) and float8_e4m3fnuz (AMD MI300, max=240). Update test capability gates from is_sm_at_least_90() to also accept MI300/MI350, and replace hardcoded torch.float8_e4m3fn test parameters with the platform-aware e4m3_dtype from torchao.float8.config.

pytorch-bot · 2026-03-04T23:16:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3996

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 66086bd with merge base f04500f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2026-03-05T00:34:07Z

looks good @brucechanglongxu please fix the linter issue though, thanks!

Remove stray blank line between third-party and first-party imports to satisfy ruff's import block formatting rules.

danielvegamyhre · 2026-03-07T06:54:32Z

    EPS: tl.constexpr,
+    FP8_MAX: tl.constexpr,
 ):
-    """


can you add this docstring back, looks like it may have been deleted by accident?

danielvegamyhre

LGTM, just one minor comment to address

…osed_rhs_kernel

brucechanglongxu · 2026-03-10T05:14:41Z

@danielvegamyhre This PR is approved and all CI checks are passing. Could you merge when you get a chance? Thanks!

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 4, 2026

danielvegamyhre self-requested a review March 4, 2026 23:19

danielvegamyhre added the module: training quantize_ api training flow label Mar 4, 2026

Fix ruff I001 import sorting lint in test_blockwise_kernels.py

dd0c705

Remove stray blank line between third-party and first-party imports to satisfy ruff's import block formatting rules.

danielvegamyhre reviewed Mar 7, 2026

View reviewed changes

danielvegamyhre approved these changes Mar 7, 2026

View reviewed changes

Restore deleted docstring on triton_fp8_blockwise_weight_quant_transp…

66086bd

…osed_rhs_kernel

brucechanglongxu mentioned this pull request Mar 10, 2026

Enable blockwise FP8 dense training kernels on ROCm #4036

Open

danielvegamyhre merged commit 629e25d into pytorch:main Mar 10, 2026
19 checks passed

brucechanglongxu mentioned this pull request Mar 11, 2026

[ROCm] Fix ROCm CI failures #4061

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable blockwise FP8 training kernels on AMD GPUs (MI300/MI350)#3996

Enable blockwise FP8 training kernels on AMD GPUs (MI300/MI350)#3996
danielvegamyhre merged 3 commits into
pytorch:mainfrom
brucechanglongxu:rocm-blockwise-fp8-enablement

brucechanglongxu commented Mar 4, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

danielvegamyhre commented Mar 5, 2026

Uh oh!

danielvegamyhre Mar 7, 2026

Uh oh!

danielvegamyhre left a comment

Uh oh!

brucechanglongxu commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brucechanglongxu commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results (AMD Instinct MI300X)

Correctness

GEMM Kernel Only: per-expert loop (old) vs grouped kernel (new)

Full Forward: old Triton vs new Triton vs BF16 baseline

Forward+Backward (new Triton path, end-to-end)

Uh oh!

pytorch-bot Bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3996

✅ No Failures

Uh oh!

danielvegamyhre commented Mar 5, 2026

Uh oh!

danielvegamyhre Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre left a comment

Choose a reason for hiding this comment

Uh oh!

brucechanglongxu commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brucechanglongxu commented Mar 4, 2026 •

edited

Loading

pytorch-bot Bot commented Mar 4, 2026 •

edited

Loading