[sparse] Migrate Float8SemiSparseTensor off of AQT by jcaip · Pull Request #3361 · pytorch/ao

jcaip · 2025-11-20T18:06:35Z

This PR migrates Float8DynamicActivationFloat8SemiSparseWeighConfig off of using the AQT CutlassSemiSparseLayout subclass.

The old AQT flow can still be used by passing version=1 into the config

Testing:

pytest test/quantization/quantize_/workflows/float8/test_float8_semi_sparse_tensor.py

pytorch-bot · 2025-11-20T18:06:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3361

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 2c7f730 with merge base 7035fb7 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/test_low_bit_optim.py::TestFSDP2::test_fsdp2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-11-20T18:21:50Z

@jcaip has imported this pull request. If you are a Meta employee, you can view this in D87560869.

jerryzh168 · 2025-11-21T18:48:27Z


+    """Use torchao cutlass kernel for fp8 + 2:4 sparse mm, requires building torchao with CUDA
+    """
+    SPARSE_CUTLASS = "sparse_cutlass"


my understanding is this is a new packing format, why is this a new kernel preference?

sparse_cutlass vs sparse_cusparselt/hipsparselt is something we will need for AMD support coming up next half, which sounds like kernel preference to me (decide which op to use).

But if this is more a general thing and packing_format is the more specific way to decide op dispatch I am fine with using that as well.

@jcaip , would be good to specify if the data format will be different and kernels different, or if data format is the same and kernels different.

jerryzh168 · 2025-11-21T18:49:36Z

+            kernel_choice = "sparse_cutlass"
+        elif kernel_preference == KernelPreference.SPARSE_CUTLASS:
+            # if user explicitly chose FBGEMM kernel preference, we'll also use fbgemm kernel
+            assert is_sm_at_least_90(), (
+                "Specified sparse_cutlass kernel and hardware is not >= SM 9.0 (>= H100)"
+            )
+            kernel_choice = "sparse_cutlass"


if "sparse_cutlass" is the only option, then I don't think we are dealing with a kernel preference here?

vkuzo · 2025-11-24T15:34:03Z

+from .float8_tensor import QuantizeTensorToFloat8Kwargs
+
+
+class Float8SemiSparseTensor(TorchAOBaseTensor):


is there a more descriptive name, something like Float8With2By4SparsityTensor?

vkuzo · 2025-11-24T15:34:19Z

+        dtype: Optional[torch.dtype] = None,
+    ):
+        super().__init__()
+        self.sparse_quantized_data = sparse_quantized_data


how about qdata to match other tensors

We can do sparse_dqata? but I think just qdata is a bit confusing since qdata is split between the specified values and metadata

vkuzo · 2025-11-24T15:35:55Z

+    """
+    Sparse packing formats for 2:4 sparsity + FP8 quantization
+    """
+    SPARSE_CUTLASS = "sparse_cutlass"


The intent is for the sparse tensor to use OPAQUE, and you can keep these formats internal to your workflow

jerryzh168 · 2025-11-24T22:15:50Z

+    SPARSE_CUTLASS = "sparse_cutlass"
+
+    """
+    SPARSE_CUSPARSELT will pack the quantized_data into a single tensor, sparse_qdata, which contains both the specified values and appends the metadata. 
+    This packing format will dispatch to `_cslt_sparse_mm`, which does not fuse per-row scaling into the matmul.
+    """
+    SPARSE_CUSPARSELT = "sparse_cusparselt"


should these belong to Float8PackingFormat? we structure these by "dtype" currently

I think Float8PackingFormat was removed recently, so we can't reuse.

it's fine to add again for this I think

jcaip

cc @bbeckca left some comments, but this should be pretty close to landing.

I think randy has some scripts for running the ads workflows, we should test the default version bump on those before we land.

One quick heads up on running this locally - you'll need to run build the custom kernel withUSE_CPP=1 pip install -e . --no-build-isolation, otherwise you won't have op support.

jcaip · 2025-12-03T21:54:18Z

+    SPARSE_CUTLASS = "sparse_cutlass"
+
+    """
+    SPARSE_CUSPARSELT will pack the quantized_data into a single tensor, sparse_qdata, which contains both the specified values and appends the metadata. 
+    This packing format will dispatch to `_cslt_sparse_mm`, which does not fuse per-row scaling into the matmul.
+    """
+    SPARSE_CUSPARSELT = "sparse_cusparselt"


I think Float8PackingFormat was removed recently, so we can't reuse.

jerryzh168 · 2025-12-18T22:21:47Z

+from .float8_tensor import QuantizeTensorToFloat8Kwargs
+
+
+class Sparse2x4Float8Tensor(TorchAOBaseTensor):


the way we structure these is one to one correspondance between packing format and tensors actually, for example:

ao/torchao/quantization/quant_api.py

Lines 895 to 934 in 8806b02

if int4_choose_qparams_algorithm == Int4ChooseQParamsAlgorithm.HQQ:

assert int4_packing_format == Int4PackingFormat.TILE_PACKED_TO_4D, (

f"Int4ChooseQParamsAlgorithm.HQQ is not supported by packing format {int4_packing_format}, "

f"it's only supported by Int4PackingFormat.TILE_PACKED_TO_4D currently"

)

if int4_packing_format == Int4PackingFormat.PRESHUFFLED:

new_weight = Int4PreshuffledTensor.from_hp(

weight,

block_size,

activation_dtype=torch.bfloat16,

)

return new_weight

elif int4_packing_format == Int4PackingFormat.PLAIN:

new_weight = Int4Tensor.from_hp(

weight,

block_size,

)

return new_weight

elif int4_packing_format == Int4PackingFormat.PLAIN_INT32:

new_weight = Int4PlainInt32Tensor.from_hp(

weight,

block_size,

)

return new_weight

elif int4_packing_format == Int4PackingFormat.MARLIN_SPARSE:

new_weight = Int4MarlinSparseTensor.from_hp(

weight,

block_size,

)

return new_weight

elif int4_packing_format == Int4PackingFormat.TILE_PACKED_TO_4D:

new_weight = Int4TilePackedTo4dTensor.from_hp(

weight,

block_size,

int4_choose_qparams_algorithm=int4_choose_qparams_algorithm,

)

return new_weight

else:

raise ValueError(f"Unsupported int4 packing format: {int4_packing_format}")

ok ill just make this the CUTLASS format and open a new PR for cuSPARSELt .

jerryzh168 · 2025-12-18T22:23:55Z

I think we should split the tensor into 2, one for each packing format

jerryzh168 · 2025-12-19T21:16:48Z

+    return out
+
+
+@implements(aten.clone.default)


this should be supported already by TorchAOBaseTensor I think?

jerryzh168 · 2025-12-19T21:17:15Z

+        args[1],
+        args[2] if len(args) > 2 else None,
+    )
+    from torchao.ops import rowwise_scaled_linear_sparse_cutlass_f8f8


nit: this is already imported in the top of the file

jerryzh168 · 2025-12-19T21:18:15Z

main thing is to move the config to the Float8DynamicActivationFloat8WeightConfig I think, others are mostly nits

jerryzh168 · 2025-12-20T00:28:31Z

    "Int8Tensor",
    "QuantizeTensorToInt8Kwargs",
    "Float8Tensor",
+    "Sparse2x4Float8TensorCUTLASS",


nit: Sparse2x4CUTLASSFloat8Tensor

jerryzh168 · 2025-12-20T00:30:09Z

+    if packing_format == Float8PackingFormat.PLAIN and isinstance(
+        weight_granularity, PerRow
+    ):
        assert weight.dtype == torch.bfloat16, (


probably better to move and duplicate this code to the config.version == 1 and config.version == 2 + packing_format == Float8PackingFormat.PLAIN branches for now, (current modification will skip this assertion for v1)

looks like v1 config just got deleted, so this should be simpler now

jerryzh168 · 2025-12-20T00:31:17Z

+                model = torch.compile(model)
+            sparse_result = model(input)
+
+            torch.testing.assert_close(


nit: I think we can use sqnr, and also compare the PLAIN and SPARSE_CUTLASS format

jerryzh168 · 2025-12-20T00:31:30Z

+            cloned = model.weight.clone().dequantize()
+
+            for o, c in zip(original, cloned):
+                torch.testing.assert_close(o, c, atol=0.0, rtol=0.0)


nit: assertEqual

jerryzh168

thanks, LGTM

jerryzh168 · 2025-12-20T01:10:02Z

+            sparse_result = model(input)
+            sparse_sqnr = compute_error(baseline_result, sparse_result)
+
+            self.assertEqual(dense_sqnr, sparse_sqnr)


would these be the same?

I meant we just compare dense_result and sparse_result and SQNR should be high

dense_result and sparse_result should be numerically identical because we mask the weights ahead of time. the differences are just because of compile.

I think thats a better check than SQNR between dense unmasked and sparse.

I see, then should we compare them directly?

I'm unsure what does "making sure sqnr are equivalent" means, is it less strict than equal? something similar to assert allclose?

yeah, i will change this to the old test. checking the sqnr is equivalent is less strict than that.

jerryzh168 · 2025-12-20T01:10:43Z

nit: I think the file name should be aligned with the tensor name

jerryzh168 · 2025-12-20T01:11:04Z

+from .float8_tensor import QuantizeTensorToFloat8Kwargs
+
+
+class Sparse2x4CUTLASSFloat8Tensor(TorchAOBaseTensor):


nit: same here, file name should be aligned with the tensor name Sparse2x4CUTLASSFloat8Tensor

jerryzh168 · 2025-12-20T01:11:14Z

+)
+
+
+class TestSparse2x4Float8Tensor(common_utils.TestCase):


also the test name

jcaip added 3 commits November 19, 2025 10:56

wip

21d1200

migration working

ab923e0

update

d0ca9fc

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 20, 2025

update

6636766

jcaip added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Nov 20, 2025

jcaip requested a review from jerryzh168 November 20, 2025 18:21

fix granularity error

47b5935

jerryzh168 reviewed Nov 21, 2025

View reviewed changes

migrate to packing format

96994db

vkuzo reviewed Nov 24, 2025

View reviewed changes

jcaip added 7 commits November 24, 2025 08:02

update

c978529

merge main

868cca7

fix ruff

9c4e421

fix typo

fac2240

update docstring

16ff85d

wip

e9fae84

make packing format specific to float8semisparsetensor

d2f51b6

jerryzh168 reviewed Nov 24, 2025

View reviewed changes

jcaip commented Dec 3, 2025

View reviewed changes

vkuzo mentioned this pull request Dec 16, 2025

Migrating from AffineQuantizedTensor + Layouts to new structure of tensor subclasses #2752

Closed

17 tasks

jcaip and others added 3 commits December 17, 2025 21:25

update

890a165

renamed tensor

d1c43d6

Merge branch 'main' into jcaip/fp8-semi-sparse-migration

783bdf8

jcaip requested a review from vkuzo December 18, 2025 05:33

jerryzh168 reviewed Dec 18, 2025

View reviewed changes

jcaip added 2 commits December 19, 2025 01:13

add custlass subclass

2298c3c

add plain to packing format

b95d35d

jcaip requested a review from jerryzh168 December 19, 2025 19:27