Skip to content

[release/2.7] Enable mx fp8 support on ROCm#2199

Merged
pruthvistony merged 4 commits intorelease/2.7from
rel_2.7_mx_f8
Jun 4, 2025
Merged

[release/2.7] Enable mx fp8 support on ROCm#2199
pruthvistony merged 4 commits intorelease/2.7from
rel_2.7_mx_f8

Conversation

@jagadish-amd
Copy link

@jagadish-amd jagadish-amd commented May 28, 2025

Ported mx fp8 part from #2046

Current test stats (accounting only blockwise scale tests)
PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v

Ran 225 tests in 8.256s
FAILED (failures=1, skipped=150)
74 test pass

fp8 mx data type sample test case.
test_blockwise_mxfp8_numerics_test_case_name_data_random_scales_one_fast_accum_True_512_128_256_cuda (main.TestFP8MatmulCudaCUDA)
hipblaslt-bench --api_method c -m 256 -n 512 -k 128 --lda 128 --ldb 128 --ldc 256 --ldd 256 --stride_a 0 --stride_b 0 --stride_c 0 --stride_d 0 --alpha 1 --beta 0 --transA T --transB N --batch_count 1 --scaleA 3 --scaleB 3 --a_type f8_r --b_type f8_r --c_type bf16_r --d_type bf16_r --compute_type f32_r --algo_method index --solution_index -2146957310 --rotating 0 --cold_iters 0 --iters 0

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
@jagadish-amd
Copy link
Author

jagadish-amd commented May 28, 2025

rel/2.7 does not support mx fp4. Hence added ROCm support to enable mx fp8.
cc @pruthvistony

@jagadish-amd jagadish-amd changed the title Enable mx f8 support on ROCm Enable mx fp8 support on ROCm May 28, 2025
@jagadish-amd jagadish-amd changed the title Enable mx fp8 support on ROCm [release/2.7] Enable mx fp8 support on ROCm May 30, 2025
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
@pruthvistony pruthvistony merged commit d17e222 into release/2.7 Jun 4, 2025
@pruthvistony pruthvistony deleted the rel_2.7_mx_f8 branch June 4, 2025 16:40
pragupta pushed a commit that referenced this pull request Jul 21, 2025
Ported mx fp8 part from #2046

Current test stats (accounting only blockwise scale tests)
PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k
test_blockwise -v

Ran 225 tests in 8.256s
FAILED (failures=1, skipped=150)
_74 test pass_

**fp8 mx data type sample test case.**

test_blockwise_mxfp8_numerics_test_case_name_data_random_scales_one_fast_accum_True_512_128_256_cuda
(__main__.TestFP8MatmulCudaCUDA)
hipblaslt-bench --api_method c -m 256 -n 512 -k 128 --lda 128 --ldb 128
--ldc 256 --ldd 256 --stride_a 0 --stride_b 0 --stride_c 0 --stride_d 0
--alpha 1 --beta 0 --transA T --transB N --batch_count 1 --scaleA 3
--scaleB 3 --a_type f8_r --b_type f8_r --c_type bf16_r --d_type bf16_r
--compute_type f32_r --algo_method index --solution_index -2146957310
--rotating 0 --cold_iters 0 --iters 0

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
(cherry picked from commit d17e222)
pragupta pushed a commit to pragupta/pytorch that referenced this pull request Jul 21, 2025
Ported mx fp8 part from ROCm#2046

Current test stats (accounting only blockwise scale tests)
PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k
test_blockwise -v

Ran 225 tests in 8.256s
FAILED (failures=1, skipped=150)
_74 test pass_

**fp8 mx data type sample test case.**

test_blockwise_mxfp8_numerics_test_case_name_data_random_scales_one_fast_accum_True_512_128_256_cuda
(__main__.TestFP8MatmulCudaCUDA)
hipblaslt-bench --api_method c -m 256 -n 512 -k 128 --lda 128 --ldb 128
--ldc 256 --ldd 256 --stride_a 0 --stride_b 0 --stride_c 0 --stride_d 0
--alpha 1 --beta 0 --transA T --transB N --batch_count 1 --scaleA 3
--scaleB 3 --a_type f8_r --b_type f8_r --c_type bf16_r --d_type bf16_r
--compute_type f32_r --algo_method index --solution_index -2146957310
--rotating 0 --cold_iters 0 --iters 0

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
(cherry picked from commit d17e222)
pragupta pushed a commit that referenced this pull request Jul 22, 2025
Ported mx fp8 part from #2046

Current test stats (accounting only blockwise scale tests)
PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k
test_blockwise -v

Ran 225 tests in 8.256s
FAILED (failures=1, skipped=150)
_74 test pass_

**fp8 mx data type sample test case.**

test_blockwise_mxfp8_numerics_test_case_name_data_random_scales_one_fast_accum_True_512_128_256_cuda
(__main__.TestFP8MatmulCudaCUDA)
hipblaslt-bench --api_method c -m 256 -n 512 -k 128 --lda 128 --ldb 128
--ldc 256 --ldd 256 --stride_a 0 --stride_b 0 --stride_c 0 --stride_d 0
--alpha 1 --beta 0 --transA T --transB N --batch_count 1 --scaleA 3
--scaleB 3 --a_type f8_r --b_type f8_r --c_type bf16_r --d_type bf16_r
--compute_type f32_r --algo_method index --solution_index -2146957310
--rotating 0 --cold_iters 0 --iters 0

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
(cherry picked from commit d17e222)
jithunnair-amd pushed a commit that referenced this pull request Jul 22, 2025
Ported mx fp8 part from #2046

Current test stats (accounting only blockwise scale tests)
PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k
test_blockwise -v

Ran 225 tests in 8.256s
FAILED (failures=1, skipped=150)
_74 test pass_

**fp8 mx data type sample test case.**

test_blockwise_mxfp8_numerics_test_case_name_data_random_scales_one_fast_accum_True_512_128_256_cuda
(__main__.TestFP8MatmulCudaCUDA)
hipblaslt-bench --api_method c -m 256 -n 512 -k 128 --lda 128 --ldb 128
--ldc 256 --ldd 256 --stride_a 0 --stride_b 0 --stride_c 0 --stride_d 0
--alpha 1 --beta 0 --transA T --transB N --batch_count 1 --scaleA 3
--scaleB 3 --a_type f8_r --b_type f8_r --c_type bf16_r --d_type bf16_r
--compute_type f32_r --algo_method index --solution_index -2146957310
--rotating 0 --cold_iters 0 --iters 0

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
(cherry picked from commit d17e222)
pragupta pushed a commit that referenced this pull request Jul 29, 2025
Ported mx fp8 part from #2046

Current test stats (accounting only blockwise scale tests)
PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k
test_blockwise -v

Ran 225 tests in 8.256s
FAILED (failures=1, skipped=150)
_74 test pass_

**fp8 mx data type sample test case.**

test_blockwise_mxfp8_numerics_test_case_name_data_random_scales_one_fast_accum_True_512_128_256_cuda
(__main__.TestFP8MatmulCudaCUDA)
hipblaslt-bench --api_method c -m 256 -n 512 -k 128 --lda 128 --ldb 128
--ldc 256 --ldd 256 --stride_a 0 --stride_b 0 --stride_c 0 --stride_d 0
--alpha 1 --beta 0 --transA T --transB N --batch_count 1 --scaleA 3
--scaleB 3 --a_type f8_r --b_type f8_r --c_type bf16_r --d_type bf16_r
--compute_type f32_r --algo_method index --solution_index -2146957310
--rotating 0 --cold_iters 0 --iters 0

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
(cherry picked from commit d17e222)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants