add a_1_128_w_128_128 (DeepSeek) float8 scaling for inference by vkuzo · Pull Request #3257 · pytorch/ao

vkuzo · 2025-10-29T11:05:25Z

Summary:

Basic enablement of the a_1_128_w_128_128 float8 scaling recipe in
torchao inference. In detail:

updates the triton kernels for this scaling type to (a) be importable in an env without triton (for CI), and (b) adds compile support for the gemm
enables the new granularity in various utility functions
wires the new granularity through the float8 inference configs
adds a test which tests for e2e numerical correctness via SQNR
comparison vs high precision baseline

For now we only have fallback kernels which requires triton and are numerically
correct but may not reach optimal performance. Performance optimization is
left for future PRs:

we should map the gemm to torch._scaled_mm for CUDA 12.9+
we should enable an fbgemm_gpu_genai path, if available in user env
we should map to a triton kernel for quantizing the weights, as
torch.compile is currently known slow for 128x128 block
quantization

Further accuracy testing and enablement of more features is left for future PRs, to keep PR size small.

Test Plan:

pytest test/quantization/quantize_/workflows/float8/test_float8_tensor.py -s -x
pytest test/dtypes/test_affine_quantized_float.py -s -x

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-10-29T11:05:26Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-10-29T11:05:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3257

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit c4769a6 with merge base 1e473ed ():

NEW FAILURES - The following jobs have failed:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio fbgemm-gpu-genai --index-url https... / linux-job (gh)
test_expected_gpu_kernel_fbgemm
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
test/sparsity/test_sparse_api.py::TestQuantSemiSparse::test_sparse_marlin_compile_True

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Basic enablement of the a_1_128_w_128_128 float8 scaling recipe in torchao inference. In detail: 1. bring the 128x128 gemm triton kernel we have out of prototype and wrap it with a custom op for `torch.compile` compatibility 2. enable the new granularity in various utility functions 3. wire the new granularity through the float8 inference configs 4. add a test which tests for e2e numerical correctness via SQNR comparison vs high precision baseline For now I added a fallback which only requires triton and is numerically correct but may not reach optimal performance. Performance optimization is left for future PRs: 1. we should map the gemm to `torch._scaled_mm` for CUDA 12.9+ 2. we should enable an fbgemm_gpu_genai path, if available in user env 3. we should map to a triton kernel for quantizing the weights, as `torch.compile` is currently known slow for 128x128 block quantization Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: db464e1 ghstack-comment-id: 3460951962 Pull-Request: #3257

[ghstack-poisoned]

Summary: Basic enablement of the a_1_128_w_128_128 float8 scaling recipe in torchao inference. In detail: 1. bring the 128x128 gemm triton kernel we have out of prototype and wrap it with a custom op for `torch.compile` compatibility 2. enable the new granularity in various utility functions 3. wire the new granularity through the float8 inference configs 4. add a test which tests for e2e numerical correctness via SQNR comparison vs high precision baseline For now I added a fallback which only requires triton and is numerically correct but may not reach optimal performance. Performance optimization is left for future PRs: 1. we should map the gemm to `torch._scaled_mm` for CUDA 12.9+ 2. we should enable an fbgemm_gpu_genai path, if available in user env 3. we should map to a triton kernel for quantizing the weights, as `torch.compile` is currently known slow for 128x128 block quantization Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c9e22bd ghstack-comment-id: 3460951962 Pull-Request: #3257

[ghstack-poisoned]

Summary: Basic enablement of the a_1_128_w_128_128 float8 scaling recipe in torchao inference. In detail: 1. bring the 128x128 gemm triton kernel we have out of prototype and wrap it with a custom op for `torch.compile` compatibility 2. enable the new granularity in various utility functions 3. wire the new granularity through the float8 inference configs 4. add a test which tests for e2e numerical correctness via SQNR comparison vs high precision baseline For now I added a fallback which only requires triton and is numerically correct but may not reach optimal performance. Performance optimization is left for future PRs: 1. we should map the gemm to `torch._scaled_mm` for CUDA 12.9+ 2. we should enable an fbgemm_gpu_genai path, if available in user env 3. we should map to a triton kernel for quantizing the weights, as `torch.compile` is currently known slow for 128x128 block quantization Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 802d26f ghstack-comment-id: 3460951962 Pull-Request: #3257

[ghstack-poisoned]

Summary: Basic enablement of the a_1_128_w_128_128 float8 scaling recipe in torchao inference. In detail: 1. bring the 128x128 gemm triton kernel we have out of prototype and wrap it with a custom op for `torch.compile` compatibility 2. enable the new granularity in various utility functions 3. wire the new granularity through the float8 inference configs 4. add a test which tests for e2e numerical correctness via SQNR comparison vs high precision baseline For now I added a fallback which only requires triton and is numerically correct but may not reach optimal performance. Performance optimization is left for future PRs: 1. we should map the gemm to `torch._scaled_mm` for CUDA 12.9+ 2. we should enable an fbgemm_gpu_genai path, if available in user env 3. we should map to a triton kernel for quantizing the weights, as `torch.compile` is currently known slow for 128x128 block quantization Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 81e336e ghstack-comment-id: 3460951962 Pull-Request: #3257

vkuzo · 2025-10-29T15:26:06Z

-        triton.cdiv(N, meta["BLOCK_SIZE"]),
+from torch.utils._triton import has_triton
+
+if has_triton():


most of the changes in this file is just indent from adding the if has_triton() statement

vkuzo · 2025-10-29T15:26:17Z

+        mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
+        tl.store(c_ptrs, c, mask=mask)
+
+    @torch.library.custom_op("ao::blockwise_fp8_gemm", mutates_args=())


non-indent change 1

vkuzo · 2025-10-29T15:26:26Z

+        )
+        return c
+
+    @blockwise_fp8_gemm.register_fake


non-indent change 2

vkuzo · 2025-10-29T15:26:35Z

+        fp8_blockwise_weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
+        return y
+
+else:


non-indent change 3

[ghstack-poisoned]

Summary: Basic enablement of the a_1_128_w_128_128 float8 scaling recipe in torchao inference. In detail: 1. bring the 128x128 gemm triton kernel we have out of prototype and wrap it with a custom op for `torch.compile` compatibility 2. enable the new granularity in various utility functions 3. wire the new granularity through the float8 inference configs 4. add a test which tests for e2e numerical correctness via SQNR comparison vs high precision baseline For now I added a fallback which only requires triton and is numerically correct but may not reach optimal performance. Performance optimization is left for future PRs: 1. we should map the gemm to `torch._scaled_mm` for CUDA 12.9+ 2. we should enable an fbgemm_gpu_genai path, if available in user env 3. we should map to a triton kernel for quantizing the weights, as `torch.compile` is currently known slow for 128x128 block quantization Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 8d58dfe ghstack-comment-id: 3460951962 Pull-Request: #3257

jerryzh168

looks good, I think we can also keep the 1x128 block to align with what deepseek is calling it now

jerryzh168

oh since we are changing the meaning of PerBlock, please update the doc a bit as well:

ao/torchao/quantization/granularity.py

Line 107 in 1e473ed

class PerBlock(Granularity):

[ghstack-poisoned]

vkuzo · 2025-10-31T19:42:36Z

CI failures exist on main branch, landing

…h#3257) * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned]

vkuzo added 3 commits October 29, 2025 04:05

Update

990ef89

[ghstack-poisoned]

Update

cce08f0

[ghstack-poisoned]

Update

681277a

[ghstack-poisoned]

This was referenced Oct 29, 2025

properly skip float8 inference tests without fbgemm #3255

Merged

move float8 blockwise kernels out of prototype #3256

Merged

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 29, 2025

vkuzo changed the title ~~Add a_1_128_w_128_128 (DeepSeek style) float8 scaling for inference~~ (wip) Add a_1_128_w_128_128 (DeepSeek style) float8 scaling for inference Oct 29, 2025

vkuzo added the topic: new feature Use this tag if this PR adds a new feature label Oct 29, 2025

vkuzo added 2 commits October 29, 2025 07:07

Update

26ade98

[ghstack-poisoned]

Update

f76e10b

[ghstack-poisoned]

Update

6994e20

[ghstack-poisoned]

vkuzo changed the title ~~(wip) Add a_1_128_w_128_128 (DeepSeek style) float8 scaling for inference~~ skeleton of a_1_128_w_128_128 (DeepSeek) float8 scaling for inference Oct 29, 2025

vkuzo changed the title ~~skeleton of a_1_128_w_128_128 (DeepSeek) float8 scaling for inference~~ add a_1_128_w_128_128 (DeepSeek) float8 scaling for inference Oct 29, 2025

Update

1aff468

[ghstack-poisoned]

vkuzo commented Oct 29, 2025

View reviewed changes

Comment thread test/quantization/quantize_/workflows/float8/test_float8_tensor.py

vkuzo commented Oct 29, 2025

View reviewed changes

vkuzo mentioned this pull request Oct 29, 2025

add bias handling for a_1_128_w_128_128 float8 scaling #3259

Merged

vkuzo added 3 commits October 30, 2025 04:17

Update

57b8876

[ghstack-poisoned]

Update

1161f7f

[ghstack-poisoned]

Update

00c6bbb

[ghstack-poisoned]

vkuzo changed the base branch from gh/vkuzo/157/head to main October 30, 2025 11:18

Update

ce5a8eb

[ghstack-poisoned]

jerryzh168 reviewed Oct 30, 2025

View reviewed changes

Comment thread test/quantization/quantize_/workflows/float8/test_float8_tensor.py Outdated

jerryzh168 reviewed Oct 30, 2025

View reviewed changes

Comment thread torchao/float8/inference.py

jerryzh168 reviewed Oct 30, 2025

View reviewed changes

Comment thread torchao/quantization/utils.py

Update

6a3684b

[ghstack-poisoned]

vkuzo mentioned this pull request Oct 30, 2025

Makes fallback float8 1x128 by 128x128 gemm output bfloat16 #3265

Merged

jerryzh168 approved these changes Oct 30, 2025

View reviewed changes

jerryzh168 reviewed Oct 30, 2025

View reviewed changes

vkuzo added 2 commits October 31, 2025 06:26

Update

6c087b4

[ghstack-poisoned]

Update

c4769a6

[ghstack-poisoned]

vkuzo mentioned this pull request Oct 31, 2025

support eval of float8_a1x128_w128x128 #3269

Merged

vkuzo merged commit b49178c into main Oct 31, 2025
44 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a_1_128_w_128_128 (DeepSeek) float8 scaling for inference#3257

add a_1_128_w_128_128 (DeepSeek) float8 scaling for inference#3257
vkuzo merged 14 commits into
mainfrom
gh/vkuzo/158/head

vkuzo commented Oct 29, 2025 •

edited

Loading

Uh oh!

vkuzo commented Oct 29, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Oct 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

vkuzo Oct 29, 2025

Uh oh!

vkuzo Oct 29, 2025

Uh oh!

vkuzo Oct 29, 2025

Uh oh!

vkuzo Oct 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerryzh168 left a comment

Uh oh!

jerryzh168 left a comment

Uh oh!

vkuzo commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vkuzo commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3257

❌ 2 New Failures

Uh oh!

Uh oh!

vkuzo Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vkuzo commented Oct 29, 2025 •

edited

Loading

vkuzo commented Oct 29, 2025 •

edited

Loading

pytorch-bot Bot commented Oct 29, 2025 •

edited

Loading