[ROCm][CI] Fix failing FP8 tests on RDNA4#174873
[ROCm][CI] Fix failing FP8 tests on RDNA4#174873mstankov-amd wants to merge 4 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174873
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 3002082 with merge base 08b6f48 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@mstankov-amd fix lint errors please. |
The linter errors are now fixed |
|
I don't understand why these cancelled jobs persist even when requesting rerun. Will try rebase against viable/strict to see if that clears the logjam. |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
d6bab24 to
3002082
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
## Summary This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16). ## Problem On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with: - 92.4% NaN outputs when M < BLOCK_M (typically 16) - Large numerical mismatches between eager and compiled results - Only occurs in `max-autotune` mode **Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations. ## Solution - Added GPU-specific compile mode selection for small M values - gfx120x with M < 16: use `compile_mode="default"` - All other cases: use `compile_mode="max-autotune"` Pull Request resolved: pytorch#174873 Approved by: https://github.com/jeffdaily (cherry picked from commit d667ffe)
## Summary This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16). ## Problem On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with: - 92.4% NaN outputs when M < BLOCK_M (typically 16) - Large numerical mismatches between eager and compiled results - Only occurs in `max-autotune` mode **Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations. ## Solution - Added GPU-specific compile mode selection for small M values - gfx120x with M < 16: use `compile_mode="default"` - All other cases: use `compile_mode="max-autotune"` Pull Request resolved: pytorch#174873 Approved by: https://github.com/jeffdaily (cherry picked from commit d667ffe)
## Summary This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16). ## Problem On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with: - 92.4% NaN outputs when M < BLOCK_M (typically 16) - Large numerical mismatches between eager and compiled results - Only occurs in `max-autotune` mode **Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations. ## Solution - Added GPU-specific compile mode selection for small M values - gfx120x with M < 16: use `compile_mode="default"` - All other cases: use `compile_mode="max-autotune"` Pull Request resolved: pytorch#174873 Approved by: https://github.com/jeffdaily (cherry picked from commit d667ffe)
Cherry-pick of #3090 Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>
## Summary This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16). ## Problem On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with: - 92.4% NaN outputs when M < BLOCK_M (typically 16) - Large numerical mismatches between eager and compiled results - Only occurs in `max-autotune` mode **Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations. ## Solution - Added GPU-specific compile mode selection for small M values - gfx120x with M < 16: use `compile_mode="default"` - All other cases: use `compile_mode="max-autotune"` Pull Request resolved: pytorch#174873 Approved by: https://github.com/jeffdaily (cherry picked from commit d667ffe)
Cherry-pick of #3090 Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>
## Summary This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16). ## Problem On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with: - 92.4% NaN outputs when M < BLOCK_M (typically 16) - Large numerical mismatches between eager and compiled results - Only occurs in `max-autotune` mode **Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations. ## Solution - Added GPU-specific compile mode selection for small M values - gfx120x with M < 16: use `compile_mode="default"` - All other cases: use `compile_mode="max-autotune"` Pull Request resolved: pytorch#174873 Approved by: https://github.com/jeffdaily (cherry picked from commit d667ffe)
## Summary This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16). ## Problem On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with: - 92.4% NaN outputs when M < BLOCK_M (typically 16) - Large numerical mismatches between eager and compiled results - Only occurs in `max-autotune` mode **Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations. ## Solution - Added GPU-specific compile mode selection for small M values - gfx120x with M < 16: use `compile_mode="default"` - All other cases: use `compile_mode="max-autotune"` Pull Request resolved: pytorch#174873 Approved by: https://github.com/jeffdaily
Summary
This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16).
Problem
On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
max-autotunemodeRoot cause: Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations.
Solution
compile_mode="default"compile_mode="max-autotune"cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben