Introduce new W8A8-FP-CSR quantitzation API by namgyu-youn · Pull Request #3258 · pytorch/ao

namgyu-youn · 2025-10-29T18:27:33Z

Summary:
Introduce new W8A8-FP-CSR quantization API, Float8SemiSparseTensor, which specializes in semi-sparse pattern using cuSPARSELt accelerations (https://docs.nvidia.com/cuda/cusparselt/)

Related Issue/PR: #2752

Future Plan:
This PR only introduces core operations (quantization/dequantization). For better API support, we have to introduce tensor utility operations like indexing and slicing.

Test Plan:
test/prototype/quantization/quantize_/float8/test_float8_semisparse_tensor.py

pytorch-bot · 2025-10-29T18:27:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3258

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm failures during provisioning step due to network issues

❌ 9 New Failures

As of commit f5f7a17 with merge base 3577306 ():

NEW FAILURES - The following jobs have failed:

PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.
Run Regression Tests / test (CPU 2.6, linux.4xlarge, torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t c223961b253dc35cb0bdfc33afeadfc867b003875b232cbcad0b6dbd3eca1083 /exec failed with exit code 2
Run Regression Tests / test (CPU 2.7, linux.4xlarge, torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t f3a656d52122687f4af72f221f6592e09add42cb355a2ca782caf8cca43de94b /exec failed with exit code 2
Run Regression Tests / test (CPU 2.8, linux.4xlarge, torch==2.8.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t 4e67dbc027c7a9da0e2f4ef6c179bf49d6762ab1f0b177eb137897267189afa1 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.6, linux.g5.12xlarge.nvidia.gpu, torch==2.6.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t a7be2ef744c754d91d45e53dca1143cc86b0c7b5cb44b056350bc8fdc2a6a56e /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t de24936e5bd69b5fc4ad624996fa49efbc9676a6109b38106319fd323040dedf /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.8, linux.g5.12xlarge.nvidia.gpu, torch==2.8.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t b751bb078c0c06a765f44b8a577b613f99360e0138229220f1117abb82369d83 /exec failed with exit code 2
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
RuntimeError: Command docker exec -t fd0dde55b33d1bafc1dd84e13e9694ae7d573194ec8f8fcfce02da8251e6abda /exec failed with exit code 2
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t be34cc8afd057d83f2f6c230c91ef55e859a8e2c22860269d800dd32aff92b39 /exec failed with exit code 2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

namgyu-youn · 2025-10-29T18:29:08Z

@jcaip could you please check this PR?

jcaip · 2025-10-31T18:23:43Z

cc @namgyu-youn

Can you split this into two PRs? one for int8 and one for float8?

In general I don't think we want to introduce weight-only sparsity configs for int8 and float8 because we don't have mixed-dtype kernel support currently. The only kernels we have are for int8 x int8 2:4 sparse and fp8 x fp8 2:4 sparse.

I would like Int8SemiSparseTensor though, but I think it should live in prototype until we have a user for it.

Also cc @bbeckca who has been working on fp8xfp8 2:4 sparse tensor subclass migration in #3182.

jerryzh168 · 2025-10-31T18:25:54Z

cc @namgyu-youn

Can you split this into two PRs? one for int8 and one for float8?

In general I don't think we want to introduce weight-only sparsity configs for int8 and float8 because we don't have mixed-dtype kernel support currently. The only kernels we have are for int8 x int8 2:4 sparse and fp8 x fp8 2:4 sparse.

I would like Int8SemiSparseTensor though, but I think it should live in prototype until we have a user for it.

Also cc @bbeckca who has been working on fp8xfp8 2:4 sparse tensor subclass migration in #3182.

@jcaip if we want to move int8 2:4 sparse to prototype, then we don't need to migrate the tensor I think

namgyu-youn · 2025-10-31T19:56:59Z

Okay, then I'll address only ~~W8A8-INT~~ W8A8-FP here and keep file structure at the prototype.

jcaip · 2025-10-31T20:10:03Z

cc @namgyu-youn I talked to @bbeckca and I think your PR is closer so lets use it instead.
Can you remove the int8 changes then and I will give this a review. Thanks for picking this up!

namgyu-youn · 2025-11-02T09:15:36Z

cc @jcaip to request review, thanks.

jcaip

cc @namgyu-youn

I think there's a bit of confusion on what the tensor subclass should be storing and how to do the op overload.

Please take a look at https://github.com/pytorch/ao/pull/3182/files#diff-afc7dd21d2b704181a6fd55be989426c0217a2bbfb694af9eb9746239ec462ed for the appropriate logic / ops to be called.

jcaip · 2025-11-03T16:14:10Z

+
+class Float8SemiSparseTensor(TorchAOBaseTensor):
+    """
+    W8A8-FP-CSR: float8 quantized tensor with 2:4 semi-structured sparsity layout


nit: comment looks wrong, CSR is compressed sparse row and it's not the sparse format used here (2:4 sparsity)

jcaip · 2025-11-03T16:16:06Z

+        float8_dtype: float8 dtype variant
+    """
+
+    tensor_data_names = ["qdata", "qdata_compressed", "scale"]


I think quantized_sparse_data and quantized_sparse_metadata would be better here for variable names.

quantized_sparse_data holds the specified values and quantized_sparse_metadata holds the sparsity metadata.

jcaip · 2025-11-03T16:16:32Z

+        )
+
+    @property
+    def qdata_fp8(self):


why do we need this?

jcaip · 2025-11-03T16:17:36Z

+        w_sparse.view(-1, 4).scatter_(1, pruning_inds, value=0)
+
+        # Check for all-zero (sparsity=1) tensor
+        if w_sparse.abs().max() == 0:


I think this should be supported actually? I don't see why we should error here.

jcaip · 2025-11-03T16:19:25Z

+        with torch.no_grad():
+            w_sparse = w.clone()
+
+        pruning_inds = w_sparse.abs().view(-1, 4).argsort(dim=1)[:, :2]


you can use this util:

ao/torchao/sparsity/utils.py

Line 101 in 315e9b4

def mask_creator(

here

jcaip · 2025-11-03T16:21:53Z

+        # Store fp8 data in both dense and compressed formats
+        fp8_data_fp16 = fp8_data.to(torch.float16)
+
+        fp8_compressed = to_sparse_semi_structured(fp8_data_fp16)


We should use the torchao cutlass packing kernels here, not the default torch ones:

ao/torchao/dtypes/floatx/cutlass_semi_sparse_layout.py

Line 171 in 315e9b4

sparse, meta = to_sparse_semi_structured_cutlass_sm9x_f8(dense)

jcaip · 2025-11-03T16:22:39Z

+        if not (scale > 0).all():
+            raise ValueError(f"Scale contains non-positive values: min={scale.min()}")
+
+        scale_expanded = scale.unsqueeze(1)


Is this different from Float8Tensor, can we use the same scale calculation logic as we use there?

jcaip · 2025-11-03T16:23:31Z

+        fp8_compressed = to_sparse_semi_structured(fp8_data_fp16)
+
+        return cls(
+            fp8_data,  # dense for dequantization


we shouldn't be storing both the dense data and the compressed data, we should be storing the sparse specified values and the sparse metadata.

jcaip · 2025-11-03T16:24:04Z

+            float8_dtype=float8_dtype,
+        )
+
+    def dequantize(self, output_dtype: Optional[torch.dtype] = None) -> torch.Tensor:


we should multiply by identity matrix to dequantize, like we do here:

ao/torchao/dtypes/floatx/cutlass_semi_sparse_layout.py

Line 133 in 315e9b4

def get_plain(self):

jcaip · 2025-11-03T16:25:26Z

+    x_vals_fp8 = scaled_x.to(torch.float8_e4m3fn)
+
+    # MatMul
+    x_padded = SparseSemiStructuredTensorCUSPARSELT._pad_dense_input(


We should use the torchao cutlass fp8 kernels, which fuse in scale multiplication here.

See

ao/torchao/dtypes/floatx/cutlass_semi_sparse_layout.py

Line 252 in 315e9b4

def _linear_fp8_act_fp8_weight_sparse_cutlass_impl(input_tensor, weight_tensor, bias):

jerryzh168 · 2025-11-03T21:35:15Z

delete? we don't want this to be in prototype I think

this should be add to the init file without the prototype in path

jerryzh168 · 2025-11-03T21:40:45Z

also need to add to Float8DynamicActivationFloat8WeightConfig?

namgyu-youn · 2025-11-13T05:57:22Z

cc @namgyu-youn

I think there's a bit of confusion on what the tensor subclass should be storing and how to do the op overload.

Please take a look at https://github.com/pytorch/ao/pull/3182/files#diff-afc7dd21d2b704181a6fd55be989426c0217a2bbfb694af9eb9746239ec462ed for the appropriate logic / ops to be called.

@jcaip Thanks a lot for the comprehensive review. I didn't know there was an already opened PR (#3182), and I found my implementation is quite far away (mostly ops, kernel). Therefore, the right move seems to be reopening #3182 and letting me update it after the last review. Is it okay to go with this?
Let me know which move is right for progress. Also, cc @bbeca who already did this work.

jcaip · 2025-11-13T16:45:38Z

@namgyu-youn I think it'll be easier for me to just migrate this over, mind if I take over the PR? #3182 is also quite far from landing.

namgyu-youn · 2025-11-16T14:46:05Z

@pytorchbot label "sparsity"

feat: semi-sparse quantization APIs

3b28e18

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 29, 2025

jerryzh168 mentioned this pull request Oct 31, 2025

Migrating from AffineQuantizedTensor + Layouts to new structure of tensor subclasses #2752

Closed

17 tasks

bbeckca mentioned this pull request Oct 31, 2025

[WIP] Move float8 cutlass sparse layout to Float8SemiSparseTensor #3182

Closed

drop W4A16-INT-CSR, W8A8-FP-CSR APIs

9a48316

namgyu-youn marked this pull request as draft November 2, 2025 05:00

namgyu-youn added 4 commits November 2, 2025 14:11

drop W8A8-INT-CSR, pick W8A8-FP-CSR

ea18e89

rename test file

2d5f86d

drop

4e7f482

update api and test

96f8374

namgyu-youn changed the title ~~Introduce new Semi-sparse quantization APIs~~ Introduce new W8A8-FP-CSR quantitzation API Nov 2, 2025

namgyu-youn marked this pull request as ready for review November 2, 2025 09:15

fix FP16 bit-range overflow

f5f7a17

jcaip reviewed Nov 3, 2025

View reviewed changes

jerryzh168 reviewed Nov 3, 2025

View reviewed changes

pytorch-bot Bot added the sparsity label Nov 16, 2025

namgyu-youn marked this pull request as draft December 1, 2025 09:24

namgyu-youn closed this Mar 30, 2026

namgyu-youn deleted the semi-sparse-quant branch March 30, 2026 17:29

Conversation

namgyu-youn commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3258

❗ 1 Active SEVs

❌ 9 New Failures

Uh oh!

namgyu-youn commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcaip commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 commented Oct 31, 2025

Uh oh!

namgyu-youn commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcaip commented Oct 31, 2025

Uh oh!

namgyu-youn commented Nov 2, 2025

Uh oh!

jcaip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn commented Nov 13, 2025

Uh oh!

jcaip commented Nov 13, 2025

Uh oh!

namgyu-youn commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

namgyu-youn commented Oct 29, 2025 •

edited

Loading

pytorch-bot Bot commented Oct 29, 2025 •

edited

Loading

namgyu-youn commented Oct 29, 2025 •

edited

Loading

jcaip commented Oct 31, 2025 •

edited

Loading

namgyu-youn commented Oct 31, 2025 •

edited

Loading