Skip to content

Enable USE_FBGEMM_GENAI#4703

Closed
cthi wants to merge 1 commit intopytorch:mainfrom
cthi:export-D79564024
Closed

Enable USE_FBGEMM_GENAI#4703
cthi wants to merge 1 commit intopytorch:mainfrom
cthi:export-D79564024

Conversation

@cthi
Copy link
Contributor

@cthi cthi commented Aug 14, 2025

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Differential Revision: D79564024

@netlify
Copy link

netlify bot commented Aug 14, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit f612129
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68acd6f73b7f3200083bb30f
😎 Deploy Preview https://deploy-preview-4703--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the cla signed label Aug 14, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79564024

cthi added a commit to cthi/pytorch that referenced this pull request Aug 14, 2025
Summary:
X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024
cthi added a commit to cthi/pytorch that referenced this pull request Aug 14, 2025
Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024
@cthi cthi force-pushed the export-D79564024 branch from c8b2a87 to d7fccba Compare August 25, 2025 21:29
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79564024

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request Aug 25, 2025
Summary:
X-link: pytorch/pytorch#160676


X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Differential Revision: D79564024
pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Aug 25, 2025
Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024
Summary:
X-link: pytorch/pytorch#160676


X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Differential Revision: D79564024
@cthi cthi force-pushed the export-D79564024 branch from d7fccba to f612129 Compare August 25, 2025 21:34
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79564024

pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Aug 26, 2025
Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024
pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Sep 2, 2025
Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in a56882d.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Sep 4, 2025
Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48)

Pull Request resolved: #160676
Approved by: https://github.com/drisspg
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48)

Pull Request resolved: pytorch#160676
Approved by: https://github.com/drisspg
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48)

Pull Request resolved: pytorch#160676
Approved by: https://github.com/drisspg
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48)

Pull Request resolved: pytorch#160676
Approved by: https://github.com/drisspg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants