Add _scaled_mm_v2 API by slayton58 · Pull Request #164141 · pytorch/pytorch

slayton58 · 2025-09-29T15:03:16Z

Stack from ghstack (oldest at bottom):

Summary:

Add new scaled-MM API to future-proof / clean-up existing code.
Scaling is explicitly described rather than infer
Swizzling of scaled must now be defined (vs. inferred)
Adds API support for multi-level scaling
Refactor dispatch logic to make it easier to add new implementations

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlaytonmeta.com>

[ghstack-poisoned]

pytorch-bot · 2025-09-29T15:03:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164141

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 49eb776 with merge base 3288fbf ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

slayton58 · 2025-09-29T15:06:06Z

@pytorchbot label "release notes: quantization"

github-actions · 2025-09-29T15:07:27Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

[ghstack-poisoned]

pytorchmergebot · 2025-09-29T16:51:47Z

Rebased gh/slayton58/17/orig onto refs/remotes/origin/viable/strict because #164142 was rebased, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/164141)

aten/src/ATen/native/cuda/Blas.cpp

Skylion007 · 2025-09-29T17:56:33Z

aten/src/ATen/native/cuda/Blas.cpp

+          bool use_fast_accum,
+          Tensor& out) {
+  // Restrictions:
+  // A, B are FP8, scales are fp32, shape M/N for A/B


Some of these should be TORCH_CHECK_VALUE or raise NotImplementedError

What's the guidance for TORCH_CHECK vs. TORCH_CHECK_VALUE vs. raising NotImplementedError? From a quick look, only TORCH_CHECK is used in this particular file to this point

@slayton58 TORCH_CHECK raise RuntimeError, TORCH_CHECK_VALUE raises ValueError, I don't know if we have a pre-exisitng macro for raising NotImplementedError. RuntimeError's are under-specified, it could be a transient error, a user error, etc. ValueError denotes it's a user error and the args to the specified function are invalid. NotImplementedError means something could be supported in upstream PyTorch, but we haven't implemented it yet.

Most of this file should probably be changed to ValueError at some point, but that can be a subsequent PR

[ghstack-poisoned]

danielvegamyhre · 2025-10-02T15:26:47Z

@slayton58 I like the explicit instead of inferred scaling type.

Question, why do the scaling recipe strings only repeat for fp8 types but not for MX types ? e.g.,

ScaledGemmImplementation::TENSORWISE_TENSORWISE => "tensorwise_tensorwise"
ScaledGemmImplementation::MXFP8_MXFP8 => "mxfp8"

Seems like we should stick with one pattern for consistency. Is having it repeat twice (presumably one for each operand) to future proof against future recipes which may having different scaling types for each operand?

drisspg · 2025-10-02T18:25:00Z

aten/src/ATen/native/native_functions.yaml

+- func: _scaled_mm_v2(Tensor self, Tensor mat2, Tensor[] scale_a, int[] recipe_a, int[] swizzle_a, Tensor[] scale_b, int[] recipe_b, int[] swizzle_b, Tensor? bias, ScalarType? out_dtype, int[] contraction_dim=[], bool use_fast_accum=False) -> Tensor
+  variants: function
+  dispatch:
+    CUDA: _scaled_mm_cuda_v2


We should think about how this will look for other backends, e.g. do we ever think that CPU will support some subset of recipes? I dont think that changes anything just wanted to note

CPU can/will certainly support some subset (or superset) of recipes - I'm honestly not sure what is / will be supported though. A CPU backend wouldn't be hard to add (or a shim to dispatch to the existing _scaled_mm_cpu backend.

aten/src/ATen/native/cuda/Blas.cpp

drisspg · 2025-10-02T18:26:42Z

aten/src/ATen/native/cuda/Blas.cpp

+ * strictly-typed enum
+ */
+template <class EnumType>
+std::vector<EnumType> convert_int_to_enum(ArrayRef<long>& v) {


Might be worth pybinding the type, I think I did this w/ SDPBackend to make sure things stay consistent

I am 100% willing to do this, I don't like the approach this second - an example would be amazing. I think I've conflated no enums through native_functions with no enums shared at all between python/C++ in pytorch..

Spoke offline but for provenance:

pytorch/torch/csrc/Module.cpp

Line 2392 in 11f5f65

py::enum_<sdp::SDPBackend>(

drisspg · 2025-10-02T18:29:18Z

aten/src/ATen/native/cuda/Blas.cpp

+  bool found_impl = false;
+  ScaledGemmImplementation gemm_impl = ScaledGemmImplementation::NONE;
+
+  for (const auto& fn_entry : scale_kernel_dispatch) {


should it be a scan down finding the first matching ? or should it just be a direct map from recipe pair to impl?

The big problem I have with a map -> impl, is that different implementations can (and imo really should) have different signatures - nvfp4xnvfp4 needs global scales, swizzles passed in addition for instance. I think this might be work-around-able with std::bind to present a unified API, but that's also messy..

Yeah that makes sense, but to confirm every accpet_fn matches against the enum right? so there is no ambiguity?

slayton58 · 2025-10-06T15:05:27Z

@slayton58 I like the explicit instead of inferred scaling type.

Question, why do the scaling recipe strings only repeat for fp8 types but not for MX types ? e.g.,

ScaledGemmImplementation::TENSORWISE_TENSORWISE => "tensorwise_tensorwise"

ScaledGemmImplementation::MXFP8_MXFP8 => "mxfp8"

Seems like we should stick with one pattern for consistency. Is having it repeat twice (presumably one for each operand) to future proof against future recipes which may having different scaling types for each operand?

@danielvegamyhre
I see what you mean here - I was writing in the sense that "mxfp8" is (to me at least) a clearly-defined combination of input types - calling "mxfp8" gives all the information one needs for both arguments, vs. having to explicitly repeat for both inputs. However, the intention was to provide an interface that was extensible enough for potential future combinations (mxfp8 activations x mxfp4 weights is one that I've seen put forward for instance), and in that sense, repeating for all inputs, so

ScaledGemmImplementation::MXFP8_MXFP8 => "mxfp8_mxfp8"

isn't a bad idea for consistency.

[ghstack-poisoned]

drisspg · 2025-10-07T19:24:04Z

aten/src/ATen/native/cuda/Blas.cpp

+ * Both inputs must be fp8
+ * A, B must only have 1 scale each, A: {Blockwise_1x128 (float), B: {Blockwise_128x128 (float)
+ */
+bool check_deepseek_recipe(ScalingType expected_recipe_a,


nit lets change the name

From deepseek? What would you prefer?

maybe just the 128 by recipes similar to the other ones

drisspg · 2025-10-07T19:25:53Z

aten/src/ATen/native/cuda/Blas.cpp

+using acceptance_fn = std::function<bool(c10::ScalarType, std::vector<ScalingType>&, ArrayRef<Tensor>&, c10::ScalarType, std::vector<ScalingType>&, ArrayRef<Tensor>&)>;
+using namespace std::placeholders;
+
+std::array<std::tuple<std::string, acceptance_fn, ScaledGemmImplementation>, 8> scale_kernel_dispatch = {{


I think I wrote a utility called

pytorch/c10/util/Array.h

Line 15 in 5f15110

return {{std::forward<T>(t)...}};

which helps w/ seg faults as we expand this list, dont ask my how I know 😂

drisspg · 2025-10-07T19:26:37Z

aten/src/ATen/native/cuda/Blas.cpp

+  { "mxfp8", check_mxfp8_recipe, ScaledGemmImplementation::MXFP8_MXFP8}}};
+
+Tensor&
+_cutlass_scaled_gemm(


Nit: name is weird since it goes to alot more than just cutlass

drisspg · 2025-10-07T19:27:45Z

aten/src/ATen/native/cuda/Blas.cpp

+#ifdef USE_ROCM
+  auto tuning_ctx = at::cuda::tunable::getTuningContext();
+  if (tuning_ctx->IsTunableOpEnabled()) {
+#define TUNABLE_DISPATCH(BLASOP_A, BLASOP_B)                            \


I would love the AMD path to be a lil more modular. E.g. one if def that is like USE_ROCM -> go to rocm path, otherwise go to cuda path. It makes it alot easier to grok

drisspg · 2025-10-07T19:28:28Z

aten/src/ATen/native/cuda/Blas.cpp

+  auto scaling_choice_b = ScalingType::RowWise;
+  //
+  // NVIDIA's cuBLAS only started supporting row-wise scaling in version 12.9,
+  // and only for compute capability 9.0+. In other cases we use CUTLASS.


Is rowwise also not supported on sm100 like the 128x recipes?

9.0+ = >= 9.0, so yes, it should be on Blackwell too (I think my CC nomenclature here is different to yours :D )

Ohh I just meant, that didn't the 128x recipes ONLY work on sm90 and not blackwell? Not sure if cublas did the same thing for the rowwise recipes

drisspg

Comment soup but LGTM

pytorchmergebot · 2025-10-08T18:47:05Z

Starting merge as part of PR stack under #164142

Summary: * Add new scaled-MM API to future-proof / clean-up existing code. * Scaling is explicitly described rather than infer * Swizzling of scaled must now be defined (vs. inferred) * Adds API support for multi-level scaling * Refactor dispatch logic to make it easier to add new implementations Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> [ghstack-poisoned]

pytorchmergebot · 2025-10-08T20:37:17Z

Starting merge as part of PR stack under #164142

pytorchmergebot · 2025-10-08T23:47:02Z

Starting merge as part of PR stack under #164142

pytorchmergebot · 2025-10-09T12:37:32Z

Starting merge as part of PR stack under #164142

Update

10631de

[ghstack-poisoned]

slayton58 requested review from eqy and syed-ahmed as code owners September 29, 2025 15:03

This was referenced Sep 29, 2025

Split Scaled matmul tests into a separate file #163855

Closed

Add scaled_mm python API, test #164142

Closed

pytorch-bot bot added the release notes: quantization release notes category label Sep 29, 2025

Update

ae35bcb

[ghstack-poisoned]

Skylion007 reviewed Sep 29, 2025

View reviewed changes

Update

38f51bf

[ghstack-poisoned]

slayton58 requested review from Chillee, EikanWang, XuehaiPan, angelayi, avikchaudhuri, bobrenjc93, divyanshk, ezyang, gujinghui, jeffdaily, jithunnair-amd, laithsakka, mikaylagawarecki, ramanishsingh, tugsbayasgalan, ydwu4 and zhxchen17 as code owners September 29, 2025 19:34

slayton58 removed request for a team, jerryzh168, justinchuby, malfet, mruberry, titaiwangms and zou3519 October 2, 2025 14:05

drisspg reviewed Oct 2, 2025

View reviewed changes

aten/src/ATen/native/cuda/Blas.cpp Show resolved Hide resolved

drisspg reviewed Oct 2, 2025

View reviewed changes

Update

6828240

[ghstack-poisoned]

slayton58 mentioned this pull request Oct 6, 2025

_scaled_mm_v2 C++ review cleanup #164745

Closed

drisspg reviewed Oct 7, 2025

View reviewed changes

drisspg approved these changes Oct 7, 2025

View reviewed changes

nikitaved mentioned this pull request Oct 8, 2025

_scaled_mm errors and _int_mm is slow with row-major rhs matrix #164491

Closed

Stonepia mentioned this pull request Oct 28, 2025

[xpu][feature] [1/3] add fp8 scaled_mm implementation for XPU #165978

Closed

This was referenced Nov 6, 2025

[XPU][Tracker] Enable torch._scaled_matmul #167170

Open

[XPU] [Feature] [2/3] add fp8 scaled_mm_v2 implementation for XPU #167518

Closed

Conversation

slayton58 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164141

✅ No Failures

Uh oh!

slayton58 commented Sep 29, 2025

Uh oh!

github-actions bot commented Sep 29, 2025

Attention! native_functions.yaml was changed

Uh oh!

pytorchmergebot commented Sep 29, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slayton58 commented Oct 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Oct 8, 2025

Uh oh!

pytorchmergebot commented Oct 8, 2025

Uh oh!

pytorchmergebot commented Oct 8, 2025

Uh oh!

pytorchmergebot commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

slayton58 commented Sep 29, 2025 •

edited

Loading

pytorch-bot bot commented Sep 29, 2025 •

edited

Loading

danielvegamyhre commented Oct 2, 2025 •

edited

Loading