Closed
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D79564024 |
cthi
added a commit
to cthi/pytorch
that referenced
this pull request
Aug 14, 2025
Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024
cthi
added a commit
to cthi/pytorch
that referenced
this pull request
Aug 14, 2025
Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D79564024 |
cthi
added a commit
to cthi/FBGEMM-1
that referenced
this pull request
Aug 25, 2025
Summary: X-link: pytorch/pytorch#160676 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Differential Revision: D79564024
pytorch-bot bot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Aug 25, 2025
Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024
Summary: X-link: pytorch/pytorch#160676 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Differential Revision: D79564024
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D79564024 |
pytorch-bot bot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Aug 26, 2025
Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024
pytorch-bot bot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Sep 2, 2025
Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024
Contributor
|
This pull request has been merged in a56882d. |
pytorchmergebot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Sep 4, 2025
Summary: X-link: pytorch/FBGEMM#4703 X-link: https://github.com/facebookresearch/FBGEMM/pull/1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Rollback Plan: Differential Revision: D79564024 Test Plan: Ensure builds with: - `USE_FBGEMM_GENAI=1` and without gfx942 - `USE_FBGEMM_GENAI=1` and with gfx942 - `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48) Pull Request resolved: #160676 Approved by: https://github.com/drisspg
markc-614
pushed a commit
to markc-614/pytorch
that referenced
this pull request
Sep 17, 2025
Summary: X-link: pytorch/FBGEMM#4703 X-link: https://github.com/facebookresearch/FBGEMM/pull/1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Rollback Plan: Differential Revision: D79564024 Test Plan: Ensure builds with: - `USE_FBGEMM_GENAI=1` and without gfx942 - `USE_FBGEMM_GENAI=1` and with gfx942 - `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48) Pull Request resolved: pytorch#160676 Approved by: https://github.com/drisspg
mansiag05
pushed a commit
to mansiag05/pytorch
that referenced
this pull request
Sep 22, 2025
Summary: X-link: pytorch/FBGEMM#4703 X-link: https://github.com/facebookresearch/FBGEMM/pull/1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Rollback Plan: Differential Revision: D79564024 Test Plan: Ensure builds with: - `USE_FBGEMM_GENAI=1` and without gfx942 - `USE_FBGEMM_GENAI=1` and with gfx942 - `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48) Pull Request resolved: pytorch#160676 Approved by: https://github.com/drisspg
dsashidh
pushed a commit
to dsashidh/pytorch
that referenced
this pull request
Sep 26, 2025
Summary: X-link: pytorch/FBGEMM#4703 X-link: https://github.com/facebookresearch/FBGEMM/pull/1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Rollback Plan: Differential Revision: D79564024 Test Plan: Ensure builds with: - `USE_FBGEMM_GENAI=1` and without gfx942 - `USE_FBGEMM_GENAI=1` and with gfx942 - `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48) Pull Request resolved: pytorch#160676 Approved by: https://github.com/drisspg
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1728
In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.
Differential Revision: D79564024