Enable USE_FBGEMM_GENAI by cthi · Pull Request #4703 · pytorch/FBGEMM

cthi · 2025-08-14T20:38:40Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Differential Revision: D79564024

netlify · 2025-08-14T20:38:45Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`f612129`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68acd6f73b7f3200083bb30f
😎 Deploy Preview	https://deploy-preview-4703--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot · 2025-08-14T20:38:49Z

This pull request was exported from Phabricator. Differential Revision: D79564024

Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024

facebook-github-bot · 2025-08-25T21:29:38Z

This pull request was exported from Phabricator. Differential Revision: D79564024

Summary: X-link: pytorch/pytorch#160676 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Differential Revision: D79564024

Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024

Summary: X-link: pytorch/pytorch#160676 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Differential Revision: D79564024

facebook-github-bot · 2025-08-25T21:34:52Z

This pull request was exported from Phabricator. Differential Revision: D79564024

Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024

facebook-github-bot · 2025-09-04T05:26:46Z

This pull request has been merged in a56882d.

Summary: X-link: pytorch/FBGEMM#4703 X-link: https://github.com/facebookresearch/FBGEMM/pull/1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Rollback Plan: Differential Revision: D79564024 Test Plan: Ensure builds with: - `USE_FBGEMM_GENAI=1` and without gfx942 - `USE_FBGEMM_GENAI=1` and with gfx942 - `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48) Pull Request resolved: #160676 Approved by: https://github.com/drisspg

Summary: X-link: pytorch/FBGEMM#4703 X-link: https://github.com/facebookresearch/FBGEMM/pull/1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Rollback Plan: Differential Revision: D79564024 Test Plan: Ensure builds with: - `USE_FBGEMM_GENAI=1` and without gfx942 - `USE_FBGEMM_GENAI=1` and with gfx942 - `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48) Pull Request resolved: pytorch#160676 Approved by: https://github.com/drisspg

meta-cla bot added the cla signed label Aug 14, 2025

facebook-github-bot added the fb-exported label Aug 14, 2025

cthi mentioned this pull request Aug 14, 2025

[ROCm] Enable USE_FBGEMM_GENAI pytorch/pytorch#160676

Closed

cthi force-pushed the export-D79564024 branch from c8b2a87 to d7fccba Compare August 25, 2025 21:29

cthi force-pushed the export-D79564024 branch from d7fccba to f612129 Compare August 25, 2025 21:34

facebook-github-bot closed this in a56882d Sep 4, 2025

facebook-github-bot added the Merged label Sep 4, 2025

gchalump added category:misc feature:genai labels Jan 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable USE_FBGEMM_GENAI#4703

Enable USE_FBGEMM_GENAI#4703
cthi wants to merge 1 commit intopytorch:mainfrom
cthi:export-D79564024

cthi commented Aug 14, 2025

Uh oh!

netlify bot commented Aug 14, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 14, 2025

Uh oh!

facebook-github-bot commented Aug 25, 2025

Uh oh!

facebook-github-bot commented Aug 25, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cthi commented Aug 14, 2025

Uh oh!

netlify bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Aug 14, 2025

Uh oh!

facebook-github-bot commented Aug 25, 2025

Uh oh!

facebook-github-bot commented Aug 25, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Aug 14, 2025 •

edited

Loading