Skip expanding scales for rowwise fp8 quantize by andrewor14 · Pull Request #2950 · pytorch/ao

andrewor14 · 2025-09-06T00:08:10Z

Summary: #2253 added a step in quantize_affine_float8 to expand the scales for blockwise quantization. The purpose of this step is to make the scales always broadcastable with the input tensor. However, this is unnecessary for rowwise quantization, which already has broadcastable shapes, e.g.

scale = [32, 1]
input = [32, 16]

Today, we will repeat_interleave the above scales to pad the scale tensor until it reaches [32, 16], which adds non-trivial memory and latency overhead. This commit adds a fast path to skip this expanding step if we detect rowwise quantization.

Test Plan:

python test/quantization/test_quant_primitives.py -k test_maybe_expand_scale_to_tensor_shape

Also compared fine-tuning Qwen3-1.7B with fp8-fp8 QAT using batch size 32 on a single H100 GPU:

Before: 25.34 GB peak memory, 3047.25 tok/s
After: 22.53 GB peak memory, 3358.49 tok/s
This PR uses 11.1% less memory and is 10.2% faster

pytorch-bot · 2025-09-06T00:10:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2950

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4cf5c90 with merge base 4872c4f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2025-09-06T04:14:30Z

        return scale

+    # For rowwise quantization, just return the scale as is
+    if scale.shape[:-1] == target_shape[:-1] and scale.shape[-1] == 1:


you could probably do something fun, like

def is_trivial_expandable(scale, target_shape): return all(a == b or a == 1 for a, b in zip(scale.shape, target_shape))

**Summary:** #2253 added a step in `quantize_affine_float8` to expand the scales for blockwise quantization. The purpose of this step is to make the scales always broadcastable with the input tensor. However, this is unnecessary for rowwise quantization, which already has broadcastable shapes, e.g. ``` scale = [32, 1] input = [32, 16] ``` Today, we will `repeat_interleave` the above scales to pad the scale tensor until it reaches `[32, 16]`, which adds non-trivial memory and latency overhead. This commit adds a fast path to skip this expanding step if we detect rowwise quantization. **Test Plan:** ``` python test/quantization/test_quant_primitives.py -k test_maybe_expand_scale_to_tensor_shape ``` Also compared fine-tuning Qwen3-1.7B with fp8-fp8 QAT using batch size 32 on a single H100 GPU: - Before: 25.34 GB peak memory, 3047.25 tok/s - After: 22.53 GB peak memory, 3358.49 tok/s - This PR uses 11.1% less memory and is 10.2% faster

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 6, 2025

andrewor14 added topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labels Sep 6, 2025

andrewor14 requested review from drisspg, jerryzh168 and vkuzo September 6, 2025 00:08

andrewor14 force-pushed the reduce-fp8-qat-memory branch from 780003d to 935ac1a Compare September 6, 2025 00:10

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 6, 2025

drisspg reviewed Sep 6, 2025

View reviewed changes

Comment thread test/quantization/test_quant_primitives.py

drisspg reviewed Sep 6, 2025

View reviewed changes

andrewor14 force-pushed the reduce-fp8-qat-memory branch from 935ac1a to 4cf5c90 Compare September 8, 2025 13:14

andrewor14 requested a review from drisspg September 8, 2025 13:14

drisspg approved these changes Sep 8, 2025

View reviewed changes

andrewor14 merged commit a54417d into main Sep 8, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip expanding scales for rowwise fp8 quantize#2950

Skip expanding scales for rowwise fp8 quantize#2950
andrewor14 merged 1 commit into
mainfrom
reduce-fp8-qat-memory

andrewor14 commented Sep 6, 2025

Uh oh!

pytorch-bot Bot commented Sep 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

drisspg Sep 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewor14 commented Sep 6, 2025

Uh oh!

pytorch-bot Bot commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2950

✅ No Failures

Uh oh!

Uh oh!

drisspg Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Sep 6, 2025 •

edited

Loading

drisspg Sep 6, 2025 •

edited

Loading