Remove amax return from _scaled_mm by drisspg · Pull Request #128683 · pytorch/pytorch

drisspg · 2024-06-14T04:34:40Z

Summary

The primary reason for the change was lack of current use case and the need to work around an two Inductor issue.

Tensor arguments as kwarg only
multiple outputs from triton templates

If the need for the amax return type arises we can consider either adding it, more likely creating a separate op.

In principle PyTorch is moving away from ops that bundle lots of functionality into "mega ops". We instead rely upon the compiler to generate appropriate fused kernels.

Changes:

This removes the amax return type from scaled_mm. We have found that the common use case is to return in "high-precision" ( a type with more precision than fp8). This is only relevant when returning in low-precision.
We currently still allow for fp8 returns and scaled result. Perhaps we should also ban this as well...

New signature:

def meta_scaled_mm(
    self: torch.Tensor,
    mat2: torch.Tensor,
    scale_a: torch.Tensor,
    scale_b: torch.Tensor,
    bias: Optional[torch.Tensor] = None,
    scale_result: Optional[torch.Tensor] = None,
    out_dtype: Optional[torch.dtype] = None,
    use_fast_accum: bool = False,
) -> torch.Tensor:

pytorch-bot · 2024-06-14T04:34:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128683

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

inductor_torchbench_perf jobs are broken due to numpy 2.0 update

✅ You can merge normally! (14 Unrelated Failures)

As of commit 476e817 with merge base f8d60e0 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?
inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?
pull / linux-jammy-py3.8-gcc11 / test (distributed, 1, 2, linux.2xlarge) (gh) (similar failure)
Process completed with exit code 1.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / rocm6.1-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2) (gh) (trunk failure)
'Test'
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
AttributeError: np.float_was removed in the NumPy 2.0 release. Usenp.float64 instead.. Did you mean: 'float16'?

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo

awesome! lg if tests pass

vkuzo · 2024-06-15T04:47:47Z

for my own curiosity, what was the reason for making scales required?

drisspg · 2024-06-15T04:57:14Z

cc @yangsiyu007 on the inductor constraint on scales, not being optional.

That being said, in retrospect I think this makes more sense. I think it makes sense that "scaled_mm" requires the scales = lol since it is pretty rare (modulo testing) that proper use of this function doesnt require scales

yangsiyu007 · 2024-06-15T05:37:34Z

[Edited] Checked that lowering now works, output: P1419523227
You can see that for tensor-wise scaling which ran first, AUTOTUNE happened between ATen _scaled_mm and the Triton templated kernels’ configs. And for rowwise scaling next, only Triton configs were tuned.

yangsiyu007 · 2024-06-15T05:51:50Z

for my own curiosity, what was the reason for making scales required?

Inductor doesn't support optional tensor inputs currently; the symptom is that it will check the layout of each input tensor and errors at seeing None (I tried a workaround with giving it an empty TensorBox, but that leads to incorrect codegen because some pass drops the unused nodes). There is a workaround for handling only 1 optional tensor input, which is why we are okay with the optional bias (by having two Triton templates, with and without bias). I'd like to work on supporting optional tensor inputs, but since it makes sense for the scales to be non-optional, we thought we'd precede for now.

drisspg · 2024-06-17T16:24:26Z

@pytorchbot merge

pytorchmergebot · 2024-06-17T16:26:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-17T16:26:47Z

Merge failed

Reason: 14 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

drisspg · 2024-06-17T16:40:04Z

@pytorchbot merge -i

Summary: Pull Request resolved: pytorch#129037 This forward fixes this diff: D58699985 Since we have a few things in flight it would be much better to forward fix this test Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda -- --exact 'caffe2/test/inductor:test_inductor_cuda - test_red_followed_by_transposed_pointwise (caffe2.test.inductor.test_torchinductor.TritonCodeGenTests)' Differential Revision: D58767577

Summary: This forward fixes this diff: D58699985 Since we have a few things in flight it would be much better to forward fix this test Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda -- --exact 'caffe2/test/inductor:test_inductor_cuda - test_red_followed_by_transposed_pointwise (caffe2.test.inductor.test_torchinductor.TritonCodeGenTests)' Differential Revision: D58767577 Pull Request resolved: #129037 Approved by: https://github.com/vkuzo

`_scaled_mm` no longer returns `amax` (see #128683) Pull Request resolved: #130582 Approved by: https://github.com/drisspg

… cases (#130868) Continuing #128683 and #130582. The api of _scaled_mm has changed. For example, there is only one return now. So change the aoti api as well. Also, tested the fp8 tests offline. The test_fp8_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface would fail with `error: use of undeclared identifier 'float8_e4m3fn'` and `error: use of undeclared identifier 'half'`, so skipping them for now. The reason this wasn't known earlier is probably because the CI doesn't use H100. Pull Request resolved: #130868 Approved by: https://github.com/drisspg, https://github.com/chenyang78, https://github.com/desertfire

`_scaled_mm` no longer returns `amax` (see pytorch#128683) Pull Request resolved: pytorch#130582 Approved by: https://github.com/drisspg

… cases (pytorch#130868) Continuing pytorch#128683 and pytorch#130582. The api of _scaled_mm has changed. For example, there is only one return now. So change the aoti api as well. Also, tested the fp8 tests offline. The test_fp8_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface would fail with `error: use of undeclared identifier 'float8_e4m3fn'` and `error: use of undeclared identifier 'half'`, so skipping them for now. The reason this wasn't known earlier is probably because the CI doesn't use H100. Pull Request resolved: pytorch#130868 Approved by: https://github.com/drisspg, https://github.com/chenyang78, https://github.com/desertfire

@choutim

Add the Inductor lowering for `torch._scaled_mm`, whose API was last updated in #128683. The lowering does: - for tensor-wise scaling, auto-tune between the default ATen kernel (cuBLAS) and Triton kernel configurations. - for row-wise scaling, auto-tune between the default ATen kernel (CUTLASS kernel added in #125204) and Triton kernel configurations. The Triton kernel template is based on htyu/FBGEMM@3ad9031 (D56337896) by @choutim, without using SPLIT_K, and that of mm `torch/_inductor/kernel/mm.py` ## Testing: - Logging shows max-autotune tuning (`AUTOTUNE scaled_mm`) for both tensor-wise and row-wise scaling when called with the two scaling types. - Row-wise scaling allows operator fusion between preceding pointwise/reduction op and amax/cast: - output code Evaluating m=256, n=256, k=256, fusion_case='pointwise', scaling_mode='row' - P1477224245 - 2 kernels - output code Evaluating m=2048, n=256, k=2048, fusion_case='reduction', scaling_mode='row' - P1477227340 - 2 kernels - UT `python test/inductor/test_fp8.py -- TestFP8Lowering` ## Benchmarking Eager/compiled tensor-wise/row-wise scaling for various shapes: https://docs.google.com/spreadsheets/d/1VfWEVuyrwoWysfbS0_u2VHJ-PsdWkF1qIsiD60AzTes/edit?gid=2113587669#gid=2113587669 - Some of the “compiled” cases are slightly slower than “eager”. It’s because max-autotune selected the ATen kernel in the compiled case, and I think the discrepancy is variance. Eager/compiled tensor-wise/row-wise scaling with pointwise/reduction preceding op for various shapes: https://docs.google.com/spreadsheets/d/1Nv07NrdffQIoDeMjo9E0V-E-EYrEN0WysO_bn1bc6ns/edit?gid=1715488446#gid=1715488446 ## Questions for reviewers: - Should the type of the accumulator `ACC_TYPE` always be in float32? If not, where is this type set (output layout?)? ## Todo: - Make the Triton template use the improved persistent kernel version (pytorch/FBGEMM#2735 by @htyu) Pull Request resolved: #130422 Approved by: https://github.com/ipiszy

amax was removed from _scaled_mm by #128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well. This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result. Pull Request resolved: #135421 Approved by: https://github.com/drisspg, https://github.com/eqy

amax was removed from _scaled_mm by pytorch#128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well. This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result. Pull Request resolved: pytorch#135421 Approved by: https://github.com/drisspg, https://github.com/eqy

…comparison in the unit test.

…comparison in the unit test, removing skip rocm decorator with cherry pick of 3ea3914

…vs_emulated_*float*_cuda and Updating unit test case based on removing amax from _scaled_mm (#1762) Fixes ROCm/frameworks-internal#8493 and ROCm/frameworks-internal#10198 cherry pick commit - 39a6179 `amax was removed from _scaled_mm by pytorch#128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well. This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result.` Also removing amax comparison in the unit test.

Looks like `out_fp8` should use matmul without scales and `out_fp8_s` with Scales were optional arguments before PR #128683 Then test_float8_scale started comparing two identical results and lost its meaning Reason of making scales required #128683 (comment) This PR uses scale=1.0 to compare result with scaled matmul Pull Request resolved: #143912 Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/pruthvistony

…vs_emulated_*float*_cuda and Updating unit test case based on removing amax from _scaled_mm (#1762) Fixes ROCm/frameworks-internal#8493 and ROCm/frameworks-internal#10198 cherry pick commit - 39a6179 `amax was removed from _scaled_mm by pytorch#128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well. This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result.` Also removing amax comparison in the unit test.

pytorch-bot bot added the ciflow/inductor label Jun 14, 2024

drisspg force-pushed the remove-low-precision-option branch from 4abcd73 to 1b7669e Compare June 14, 2024 04:58

drisspg changed the title ~~Remove amax return types~~ Remove amax return from _scaled_mm Jun 14, 2024

drisspg force-pushed the remove-low-precision-option branch 4 times, most recently from 1c46fb1 to b1f566c Compare June 14, 2024 23:46

drisspg marked this pull request as ready for review June 14, 2024 23:46

drisspg requested a review from eqy as a code owner June 14, 2024 23:46

drisspg mentioned this pull request Jun 14, 2024

Updates with new scaled-mm api meta-pytorch/float8_experimental#284

Closed

drisspg requested review from vkuzo and yangsiyu007 June 15, 2024 00:04

drisspg added the topic: not user facing topic category label Jun 15, 2024

drisspg force-pushed the remove-low-precision-option branch 2 times, most recently from 5a24d53 to dc1313b Compare June 15, 2024 01:55

vkuzo approved these changes Jun 15, 2024

View reviewed changes

remove amax return types

476e817

drisspg force-pushed the remove-low-precision-option branch from dc1313b to 476e817 Compare June 17, 2024 03:01

drisspg added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 17, 2024

pytorchmergebot added the merging label Jun 17, 2024

pytorchmergebot removed the merging label Jun 17, 2024

Aidyn-A mentioned this pull request Jul 11, 2024

[TEST][Inductor] Fix scaled_mm call #130582

Closed

pytorchmergebot pushed a commit that referenced this pull request Jul 12, 2024

[TEST][Inductor] Fix scaled_mm call (#130582)

22fd89c

`_scaled_mm` no longer returns `amax` (see #128683) Pull Request resolved: #130582 Approved by: https://github.com/drisspg

yangsiyu007 mentioned this pull request Jul 12, 2024

Add lowering for updated _scaled_mm (fixing submodules) #130422

Closed

henrylhtsang mentioned this pull request Jul 16, 2024

[aoti] refactor aoti_torch__scaled_mm and skip aoti fp8 test for some cases #130868

Closed

xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Jul 25, 2024

[TEST][Inductor] Fix scaled_mm call (pytorch#130582)

4a7ce32

`_scaled_mm` no longer returns `amax` (see pytorch#128683) Pull Request resolved: pytorch#130582 Approved by: https://github.com/drisspg

jeffdaily mentioned this pull request Sep 7, 2024

remove amax_ptr from scaled_gemm #135421

Closed

amd-sriram mentioned this pull request Nov 19, 2024

[Release/2.4] Remove amax_ptr from scaled_gemm for UT test_scaled_mm_vs_emulated_*float*_cuda ROCm/pytorch#1735

Closed

jeffdaily mentioned this pull request Nov 21, 2024

[release/2.5] remove amax_ptr from scaled_gemm (#135421) ROCm/pytorch#1741

Merged

amd-sriram added a commit to ROCm/pytorch that referenced this pull request Nov 22, 2024

amax was removed from _scaled_mm by pytorch#128683. So removing amax …

1ac570c

…comparison in the unit test.

amd-sriram added a commit to ROCm/pytorch that referenced this pull request Dec 2, 2024

amax was removed from _scaled_mm by pytorch#128683. So removing amax …

9e9eab3

…comparison in the unit test, removing skip rocm decorator with cherry pick of 3ea3914

dnikolaev-amd mentioned this pull request Dec 27, 2024

Fix always true scaled_mm test #143912

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove amax return from _scaled_mm#128683

Remove amax return from _scaled_mm#128683
drisspg wants to merge 1 commit intopytorch:mainfrom
drisspg:remove-low-precision-option

drisspg commented Jun 14, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 14, 2024 •

edited

Loading

Uh oh!

vkuzo left a comment

Uh oh!

vkuzo commented Jun 15, 2024

Uh oh!

drisspg commented Jun 15, 2024

Uh oh!

yangsiyu007 commented Jun 15, 2024 •

edited

Loading

Uh oh!

yangsiyu007 commented Jun 15, 2024

Uh oh!

drisspg commented Jun 17, 2024

Uh oh!

pytorchmergebot commented Jun 17, 2024

Uh oh!

pytorchmergebot commented Jun 17, 2024

Uh oh!

drisspg commented Jun 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

drisspg commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes:

Uh oh!

pytorch-bot bot commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128683

❗ 1 Active SEVs

✅ You can merge normally! (14 Unrelated Failures)

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Jun 15, 2024

Uh oh!

drisspg commented Jun 15, 2024

Uh oh!

yangsiyu007 commented Jun 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yangsiyu007 commented Jun 15, 2024

Uh oh!

drisspg commented Jun 17, 2024

Uh oh!

pytorchmergebot commented Jun 17, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 17, 2024

Merge failed

Uh oh!

drisspg commented Jun 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

drisspg commented Jun 14, 2024 •

edited

Loading

pytorch-bot bot commented Jun 14, 2024 •

edited

Loading

yangsiyu007 commented Jun 15, 2024 •

edited

Loading