Commit d667ffe
[ROCm][CI] Fix failing FP8 tests on RDNA4 (pytorch#174873)
## Summary
This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16).
## Problem
On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode
**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations.
## Solution
- Added GPU-specific compile mode selection for small M values
- gfx120x with M < 16: use `compile_mode="default"`
- All other cases: use `compile_mode="max-autotune"`
Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily1 parent fc90fdf commit d667ffe
1 file changed
Lines changed: 24 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1034 | 1034 | | |
1035 | 1035 | | |
1036 | 1036 | | |
| 1037 | + | |
| 1038 | + | |
| 1039 | + | |
| 1040 | + | |
| 1041 | + | |
| 1042 | + | |
| 1043 | + | |
| 1044 | + | |
| 1045 | + | |
| 1046 | + | |
| 1047 | + | |
1037 | 1048 | | |
1038 | 1049 | | |
1039 | | - | |
| 1050 | + | |
1040 | 1051 | | |
1041 | 1052 | | |
1042 | 1053 | | |
| |||
1334 | 1345 | | |
1335 | 1346 | | |
1336 | 1347 | | |
| 1348 | + | |
| 1349 | + | |
| 1350 | + | |
| 1351 | + | |
| 1352 | + | |
| 1353 | + | |
| 1354 | + | |
| 1355 | + | |
| 1356 | + | |
| 1357 | + | |
| 1358 | + | |
1337 | 1359 | | |
1338 | 1360 | | |
1339 | | - | |
| 1361 | + | |
1340 | 1362 | | |
1341 | 1363 | | |
1342 | 1364 | | |
| |||
0 commit comments