use cooperative schedule in scaled_mm for fast_accum=false by ngimel · Pull Request #144809 · pytorch/pytorch

ngimel · 2025-01-14T23:12:28Z

This improves perf for large matrices by more than 2x, more detailed benchmark coming.
On master

On this branch

A plot similar to pytorch/ao#1325 (comment)

Benchmarking code:

import torch
from triton.testing import do_bench
import itertools

def fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False):
    return torch._scaled_mm(a, b.t(), scale_a.view(-1, 1), scale_b.view(1, -1), use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16)

def fn_aten(a, b, scale, use_fast_accum=False):
    return torch._scaled_mm(a, b.t(), scale, scale, use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16)

for i,j,k in itertools.product(range(9, 15), range(9, 15), range(9, 15)):
    m = 2**i
    n = 2**j
    k = 2**k

    a=torch.randn(m, k, device="cuda").to(dtype=torch.float8_e4m3fn)
    b=torch.randn(n, k, device="cuda").to(dtype=torch.float8_e4m3fn)
    scale_a = torch.randint(1, 11, (a.shape[0],), device="cuda", dtype=torch.float32)
    scale_b = torch.randint(1, 11, (b.shape[0],), device="cuda", dtype=torch.float32)
    scale_0 = torch.randn((), device="cuda", dtype=torch.float32)

    ms_rowwise_fast = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=True), warmup=25, rep=50)
    ms_rowwise_slow = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False), warmup=25, rep=50)

    ms_tensor_fast = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=True), warmup=25, rep=50)
    ms_tensor_slow = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=False), warmup=25, rep=50)

    print(f"m={m}, n={n}, k={k}, fast={ms_rowwise_fast}, slow={ms_rowwise_slow}, ratio_tw={ms_tensor_slow /ms_tensor_fast}, ratio_rw={ms_rowwise_slow / ms_rowwise_fast}")

Higher N/K values still have about 40% penalty, perhaps some additional heuristics tweaks would be useful.

pytorch-bot · 2025-01-14T23:12:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144809

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6e61be7 with merge base 64bcf39 ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh) (#144480)
backends/xnnpack/test/ops/test_conv1d.py::TestConv1d::test_qs8_conv1d_batchnorm_seq

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ngimel · 2025-01-15T20:50:09Z

@pytorchbot merge

pytorchmergebot · 2025-01-15T20:51:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

use cooperative schedule in scaled_mm for fast_accum=false

6e61be7

ngimel requested review from eqy and syed-ahmed as code owners January 14, 2025 23:12

pytorch-bot Bot added the release notes: cuda release notes category label Jan 14, 2025

ngimel requested review from drisspg and lw January 14, 2025 23:13

drisspg approved these changes Jan 14, 2025

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 15, 2025

pytorchmergebot added the merging label Jan 15, 2025

pytorchmergebot added the Merged label Jan 15, 2025

pytorchmergebot closed this in 4e1834f Jan 15, 2025

pytorchmergebot removed the merging label Jan 15, 2025

github-actions Bot deleted the ngimel/scaled_mm_coop branch February 15, 2025 02:04

danielvegamyhre mentioned this pull request May 8, 2025

[float8] Investigate if workaround for slow cutlass rowwise GEMM when fast_accum=False is still needed after perf improvments and potentially optimize GEMM further pytorch/ao#2184

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use cooperative schedule in scaled_mm for fast_accum=false#144809

use cooperative schedule in scaled_mm for fast_accum=false#144809
ngimel wants to merge 1 commit into
mainfrom
ngimel/scaled_mm_coop

ngimel commented Jan 14, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jan 14, 2025 •

edited

Loading

Uh oh!

ngimel commented Jan 15, 2025

Uh oh!

pytorchmergebot commented Jan 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ngimel commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144809

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

ngimel commented Jan 15, 2025

Uh oh!

pytorchmergebot commented Jan 15, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngimel commented Jan 14, 2025 •

edited

Loading

pytorch-bot Bot commented Jan 14, 2025 •

edited

Loading