[Quant] add FP8 support in quantize ops by LevelDownRefine · Pull Request #153601 · pytorch/pytorch

LevelDownRefine · 2025-05-15T08:14:18Z

Quant used to be used for integers.
But now we want to use it for fp8.

This patch determine whether to round according to dtype.

cc @ezyang @SherlockNoMad @EikanWang @jgong5 @wenzhe-nrv @voznesenskym @penguinwu @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-05-15T08:14:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153601

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a88490a with merge base fe285b9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Xia-Weiwen

My suggestion on PR title: [Quant] add FP8 support in quantize ops

jerryzh168 · 2025-05-16T17:21:44Z

@@ -3,6 +3,7 @@
 from typing import Optional


we are deprecating these, can you add to torchao? or maybe it's already supported in torchao

https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_primitives.py

Yeah. Looks like there isn't such an issue with the ops in Torchao.

LevelDownRefine · 2025-05-22T06:22:57Z

My suggestion on PR title: [Quant] add FP8 support in quantize ops

Done

For support pytorch/ao#2228 > What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph. > > However we met problems with these q/dq ops both in the PyTorch core and Torchao. > > PyTorch core: > > The quantize_per_tensor op does not support FP8. We want to fix it via #153601. And as you commented, the op is deprecated. > Torchao: > > In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor: > https://github.com/pytorch/pytorch/blob/100ec0b34aeff2b948dae33937857d0c86cf1646/torch/_inductor/fx_passes/freezing_patterns.py#L69C1-L74C1 > After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now. > For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: https://github.com/pytorch/pytorch/blob/100ec0b34aeff2b948dae33937857d0c86cf1646/torch/_inductor/constant_folding.py#L139C1-L149C1 . But for the torchao.dequantize_affine_float8, we cannot do this because > It is an op from Torchao, which is unknown to the constant folder > It is decomposed to smaller ops, so we cannot put it in the list as a single op. > So, we think an easy and short-term solution is to modify the ops in PyTorch core via #153601. > However, if we want to resolve the issue with Torchao, we need to > Add a method in the constant folder in Inductor to allow registration of impure ops Based on [Jansel‘s reply](pytorch/ao#2228 (comment)), add dont constant fold flag on this patch Pull Request resolved: #154945 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>

For support pytorch/ao#2228 > What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph. > > However we met problems with these q/dq ops both in the PyTorch core and Torchao. > > PyTorch core: > > The quantize_per_tensor op does not support FP8. We want to fix it via pytorch#153601. And as you commented, the op is deprecated. > Torchao: > > In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor: > https://github.com/pytorch/pytorch/blob/100ec0b34aeff2b948dae33937857d0c86cf1646/torch/_inductor/fx_passes/freezing_patterns.py#L69C1-L74C1 > After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now. > For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: https://github.com/pytorch/pytorch/blob/100ec0b34aeff2b948dae33937857d0c86cf1646/torch/_inductor/constant_folding.py#L139C1-L149C1 . But for the torchao.dequantize_affine_float8, we cannot do this because > It is an op from Torchao, which is unknown to the constant folder > It is decomposed to smaller ops, so we cannot put it in the list as a single op. > So, we think an easy and short-term solution is to modify the ops in PyTorch core via pytorch#153601. > However, if we want to resolve the issue with Torchao, we need to > Add a method in the constant folder in Inductor to allow registration of impure ops Based on [Jansel‘s reply](pytorch/ao#2228 (comment)), add dont constant fold flag on this patch Pull Request resolved: pytorch#154945 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>

For support pytorch/ao#2228 > What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph. > > However we met problems with these q/dq ops both in the PyTorch core and Torchao. > > PyTorch core: > > The quantize_per_tensor op does not support FP8. We want to fix it via #153601. And as you commented, the op is deprecated. > Torchao: > > In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor: > https://github.com/pytorch/pytorch/blob/100ec0b34aeff2b948dae33937857d0c86cf1646/torch/_inductor/fx_passes/freezing_patterns.py#L69C1-L74C1 > After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now. > For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: https://github.com/pytorch/pytorch/blob/100ec0b34aeff2b948dae33937857d0c86cf1646/torch/_inductor/constant_folding.py#L139C1-L149C1 . But for the torchao.dequantize_affine_float8, we cannot do this because > It is an op from Torchao, which is unknown to the constant folder > It is decomposed to smaller ops, so we cannot put it in the list as a single op. > So, we think an easy and short-term solution is to modify the ops in PyTorch core via #153601. > However, if we want to resolve the issue with Torchao, we need to > Add a method in the constant folder in Inductor to allow registration of impure ops Based on [Jansel‘s reply](pytorch/ao#2228 (comment)), add dont constant fold flag on this patch Pull Request resolved: #154945 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>

LevelDownRefine · 2025-07-16T08:09:27Z

supported on torchao

pytorch-bot Bot added module: inductor release notes: quantization release notes category release notes: AO frontend labels May 15, 2025

facebook-github-bot added the fx label May 15, 2025

pytorchbot added the open source label May 15, 2025

LevelDownRefine marked this pull request as draft May 15, 2025 08:16

determine whether to round according to dtype

fa508f3

LevelDownRefine marked this pull request as ready for review May 16, 2025 02:54

LevelDownRefine requested a review from jerryzh168 as a code owner May 16, 2025 02:54

CaoE reviewed May 16, 2025

View reviewed changes

Comment thread test/inductor/test_cpu_repro.py

CaoE reviewed May 16, 2025

View reviewed changes

Comment thread torch/_inductor/lowering.py

Xia-Weiwen reviewed May 16, 2025

View reviewed changes

Comment thread test/inductor/test_cpu_repro.py Outdated

Comment thread test/inductor/test_cpu_repro.py Outdated

Comment thread test/quantization/core/test_quantized_tensor.py Outdated

LevelDownRefine added 2 commits May 16, 2025 09:22

add decomposed test

ebe848b

add lowering ut; improve code style

06456a5

jerryzh168 reviewed May 16, 2025

View reviewed changes

jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 16, 2025

LevelDownRefine changed the title ~~determine whether to round according to dtype~~ [Quant] add FP8 support in quantize ops May 22, 2025

LevelDownRefine mentioned this pull request May 22, 2025

[Quant] Can quant not be decomposed on inductor? pytorch/ao#2228

Closed

LevelDownRefine requested review from CaoE and Xia-Weiwen May 22, 2025 06:55

LevelDownRefine added 2 commits May 22, 2025 13:59

refine code

8ceabcb

Merge remote-tracking branch 'origin/main' into wengshiy/fp8_quant

a88490a

LevelDownRefine requested a review from jerryzh168 May 23, 2025 01:32

LevelDownRefine mentioned this pull request Jun 3, 2025

Add dont constant fold flag #154945

Closed

LevelDownRefine mentioned this pull request Jun 9, 2025

skip quant/dequant decomposed pytorch/ao#2299

Closed

LevelDownRefine mentioned this pull request Jun 16, 2025

[float8] Prevent quantize_affine_float8/dequantize_affine_float8 decomposed on inductor pytorch/ao#2379

Merged

LevelDownRefine closed this Jul 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quant] add FP8 support in quantize ops#153601

[Quant] add FP8 support in quantize ops#153601
LevelDownRefine wants to merge 5 commits into
pytorch:mainfrom
LevelDownRefine:wengshiy/fp8_quant

LevelDownRefine commented May 15, 2025 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented May 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerryzh168 May 16, 2025 •

edited

Loading

Uh oh!

Xia-Weiwen May 17, 2025

Uh oh!

LevelDownRefine commented May 22, 2025

Uh oh!

LevelDownRefine commented Jul 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

LevelDownRefine commented May 15, 2025 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153601

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerryzh168 May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen May 17, 2025

Choose a reason for hiding this comment

Uh oh!

LevelDownRefine commented May 22, 2025

Uh oh!

LevelDownRefine commented Jul 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LevelDownRefine commented May 15, 2025 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented May 15, 2025 •

edited

Loading

jerryzh168 May 16, 2025 •

edited

Loading