fix float8 rowwise inference perf with torch.compile by vkuzo · Pull Request #2672 · pytorch/ao

vkuzo · 2025-08-04T14:45:08Z

In #2379, logic was added which prevented torchinductor from fusing the activation quantization for float8 inference. Here are some logs which show the extra kernels being added by that PR to float8 inference on NVIDIA GPUs: https://www.internalfb.com/phabricator/paste/view/P1891592748 .

This PR reverts most of #2379, and adds a test to ensure we see the correct # of GPU kernels for float8 tensorwise and rowwise quantization. We'll have to re-do #2379 without breaking this test.

Perf impact of this PR on MKN == 1024, 2048, 4096 on an NVIDIA H100 for float8 rowwise inference:

torch.compile default: 50.9us -> 24.2us (2.1x speedup)
torch.compile reduce-overhead: 66.4us -> 41.5us (1.6x speedup)
logs: https://gist.github.com/vkuzo/1d35bea7fa5d38040d4d14d95627ce7f

Note that I added a benchmark to benchmarks/inference/bench_float8_inference.py to reproduce the numbers above, but I ran this benchmark out-of-tree to get the actual numbers, for easier comparison of before-this-PR vs after-this-PR.

Summary:

Test Plan:

TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 pytest test/dtypes/test_affine_quantized_float.py -s -k expected_kernels_on_gpu

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-08-04T14:45:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2672

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Pending

As of commit 8c23d32 with merge base 7dbc816 ():

NEW FAILURES - The following jobs have failed:

Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
test/test_low_bit_optim.py::TestFSDP2::test_uneven_shard
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
test/test_low_bit_optim.py::TestFSDP2::test_uneven_shard

This comment was automatically generated by Dr. CI and updates every 15 minutes.

In #2379, logic was added which prevented torchinductor from fusing the activation quantization for float8 inference. This PR reverts most of #2379, and adds a test to ensure we see the correct # of GPU kernels for float8 tensorwise and rowwise quantization. We'll have to re-do #2379 without breaking this test. Summary: Test Plan: ```bash TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 pytest test/dtypes/test_affine_quantized_float.py -s -k expected_kernels_on_gpu ``` Reviewers: Subscribers: Tasks: Tags:

jerryzh168

Thanks for the fix, I feel this might be related to the fbgemm benchmark regression as well

In #2379, logic was added which prevented torchinductor from fusing the activation quantization for float8 inference. This PR reverts most of #2379, and adds a test to ensure we see the correct # of GPU kernels for float8 tensorwise and rowwise quantization. We'll have to re-do #2379 without breaking this test. Summary: Test Plan: ```bash TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 pytest test/dtypes/test_affine_quantized_float.py -s -k expected_kernels_on_gpu ``` Reviewers: Subscribers: Tasks: Tags:

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 4, 2025

vkuzo added the topic: bug fix Use this tag for PRs that fix bugs label Aug 4, 2025

vkuzo force-pushed the 20250804_fix_float8_rowwise_compile branch from 3e9752c to 8c23d32 Compare August 4, 2025 15:14

vkuzo requested review from drisspg and jerryzh168 August 4, 2025 15:15

jerryzh168 approved these changes Aug 4, 2025

View reviewed changes

vkuzo merged commit 6a74e34 into main Aug 4, 2025
17 of 20 checks passed

jerryzh168 mentioned this pull request Aug 5, 2025

[Inductor][float8] Support qlinear for float8 in inductor #2565

Merged

LevelDownRefine mentioned this pull request Aug 28, 2025

[CPU][FP8][Inductor] How to support fp8 quant for inductor on CPU #2896

Closed

LevelDownRefine mentioned this pull request Sep 9, 2025

[Float8] add non-decomposed version of quantize/dequantize ops for fp8 #2961

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix float8 rowwise inference perf with torch.compile#2672

fix float8 rowwise inference perf with torch.compile#2672
vkuzo merged 1 commit into
mainfrom
20250804_fix_float8_rowwise_compile

vkuzo commented Aug 4, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Aug 4, 2025 •

edited

Loading

Uh oh!

jerryzh168 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vkuzo commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2672

❌ 2 New Failures, 2 Pending

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vkuzo commented Aug 4, 2025 •

edited

Loading

pytorch-bot Bot commented Aug 4, 2025 •

edited

Loading