Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning by alexsamardzic · Pull Request #119986 · pytorch/pytorch

alexsamardzic · 2024-02-15T14:08:05Z

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

…otuning [ghstack-poisoned]

pytorch-bot · 2024-02-15T14:08:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119986

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (8 Unrelated Failures)

As of commit 6ca9970 with merge base 5b90074 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-12-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh)
test_tensorboard.py::TestTensorBoardSummary::test_hparams_bool

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_timm, 1, 2, linux.12xlarge) (gh)
Process completed with exit code 1.
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_timm, 2, 2, linux.12xlarge) (gh)
Process completed with exit code 1.
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh)
stable_diffusion_unet
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_timm, 1, 2, linux.12xlarge) (gh)
Process completed with exit code 1.
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_timm, 2, 2, linux.12xlarge) (gh)
Process completed with exit code 1.
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh)
stable_diffusion_unet
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.12xlarge) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

alexsamardzic · 2024-02-15T14:21:54Z

This PR enables generating CUTLASS kernels as candidates for auto-tuning of mixed mm() op for cases where one of inputs is either int8 or uint8, and other input is either float16 or bfloat16.

Example code

import torch

from torch._inductor import config

_CUTLASS_DIR = ".../pytorch/third_party/cutlass"
max_autotune_gemm_backends = "CUTLASS"
dynamic = False

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False

op = torch.mm
def my_op(a, b):
    bt = b.T
    return op(a, bt.to(a.dtype))


a = torch.rand((512, 2048), dtype=torch.float16).cuda()
b = torch.randint(0, 10, (1024, 2048), dtype=torch.int8).cuda()
dtype = a.dtype if a.element_size() >= b.element_size() else b.dtype

with config.patch(
    {
        "max_autotune": True,
        "autotune_in_subproc": False,
        "max_autotune_gemm_backends": max_autotune_gemm_backends,
        "cuda.cutlass_dir": _CUTLASS_DIR,
        "cuda.cutlass_max_profiling_configs": 8,
        "use_mixed_mm": True,
    }
):
    Y_compiled = torch.compile(my_op, dynamic=dynamic)(a, b)
    Y = my_op(a.to(dtype), b.to(dtype))
    print(Y_compiled[0:5, 0:5])
    print(Y[0:5, 0:5])
    torch.testing.assert_close(Y_compiled, Y)

Note that, just as mentioned for previous PR in this stack, CUTLASS can only handle the case when first operand is in row-major, and second operand in column-major layout.

@ipiszy @cpuhrsch

…nductor autotuning" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

…otuning ghstack-source-id: fb8cdc0 Pull Request resolved: #119986

…nductor autotuning" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

…MM autotuning" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 12836d4 Pull Request resolved: #119986

kadeng

Also looks good in principle, but I would like to see some tests added to test_max_autotune.py - Are cutlass Kernels actually picked during autotuning if you don't force them to, e.g. are there cases when they are fastest?

…MM autotuning" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 5b5d0a3 Pull Request resolved: #119986

…MM autotuning" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 6801611 Pull Request resolved: #119986

alexsamardzic · 2024-02-28T16:33:59Z

Added pure MM operation test into test_max_autotune.py, alike to what mentioned in this comment for _int_mm() operator tuning. Also, here are benchmarking results of CUTLASS vs. Triton generated kernels for "Llama shapes" (benchmarking script given in the same comment, it only has to be run with mixed command line argument instead of int8):

Note that here results are actually from two rounds of benchmarking: Triton only supports row-major/row-major combination of layouts here, while CUTLASS only supports row-major/column-major combination of layouts. So, while CUTLASS is faster most of the times, it's not exactly apples-to-apples; on the other side, CUTLASS here provides an auto-tuning options that is not available without it.

…MM autotuning" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 660fb41 Pull Request resolved: #119986

cpuhrsch · 2024-03-09T04:03:01Z

@kadeng - Could you take another look please? Thank you.

…MM autotuning" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: d6c7429 Pull Request resolved: #119986

…MM autotuning" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: ef47edb Pull Request resolved: #119986

kadeng

Looks good to me, thanks for your contribution!

Before we can merge: There's a conflicting PR in the process of being merged #121489 that moves the Cutlass backend tests into a separate file called test_cutlass_backend.py. I think we should wait until that one lands and then also move the tests from this PR into test_cutlass_backend.py.

alexsamardzic · 2024-03-12T18:45:15Z

Sure, no problem.

…MM autotuning" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

alexsamardzic · 2024-03-13T09:21:22Z

Rebased on latest main, that now incudes #121489 - newly added tests moved into test_cutlass_backend.py.

…MM autotuning" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: f7b71b9 Pull Request resolved: #119986

…MM autotuning" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: bd65a5d Pull Request resolved: #119986

alexsamardzic · 2024-03-14T13:26:34Z

@pytorchbot merge

pytorchmergebot · 2024-03-14T13:29:44Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Add CUTLASS kernel as choice for (u)int8/(b)float16 mm() Inductor aut…

fe3bdce

…otuning [ghstack-poisoned]

alexsamardzic mentioned this pull request Feb 15, 2024

Add CUTLASS kernel as choice for _int_mm() Inductor autotuning #119685

Closed

github-actions Bot added module: inductor ciflow/inductor labels Feb 15, 2024

alexsamardzic added open source topic: not user facing topic category labels Feb 15, 2024

alexsamardzic requested a review from ipiszy February 15, 2024 14:22

alexsamardzic added a commit that referenced this pull request Feb 15, 2024

Add CUTLASS kernel as choice for (u)int8/(b)float16 mm() Inductor aut…

e1c8bbb

…otuning ghstack-source-id: fb8cdc0 Pull Request resolved: #119986

alexsamardzic changed the title ~~Add CUTLASS kernel as choice for (u)int8/(b)float16 mm() Inductor autotuning~~ Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning Feb 15, 2024

alexsamardzic added a commit that referenced this pull request Feb 15, 2024

Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning

ee5ba72

ghstack-source-id: 12836d4 Pull Request resolved: #119986

alexsamardzic mentioned this pull request Feb 16, 2024

Add couple configs into generator.py for mixed input MM NVIDIA/cutlass#1350

Merged

cpuhrsch requested a review from kadeng February 22, 2024 20:24

kadeng suggested changes Feb 23, 2024

View reviewed changes

alexsamardzic added a commit that referenced this pull request Feb 26, 2024

Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning

801cd59

ghstack-source-id: 5b5d0a3 Pull Request resolved: #119986

alexsamardzic mentioned this pull request Feb 27, 2024

Add CUTLASS kernel as choice for _int_mm() Inductor autotuning #120729

Closed

alexsamardzic added a commit that referenced this pull request Feb 27, 2024

Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning

c620c35

ghstack-source-id: 6801611 Pull Request resolved: #119986

alexsamardzic added a commit that referenced this pull request Feb 29, 2024

Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning

502547a

ghstack-source-id: 660fb41 Pull Request resolved: #119986

cpuhrsch requested a review from kadeng March 9, 2024 04:02

kadeng reviewed Mar 11, 2024

View reviewed changes

Comment thread torch/_inductor/kernel/mm.py

alexsamardzic added a commit that referenced this pull request Mar 11, 2024

Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning

27b2011

ghstack-source-id: d6c7429 Pull Request resolved: #119986

alexsamardzic added a commit that referenced this pull request Mar 12, 2024

Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning

75ebfc4

ghstack-source-id: ef47edb Pull Request resolved: #119986

kadeng approved these changes Mar 12, 2024

View reviewed changes

alexsamardzic added a commit that referenced this pull request Mar 13, 2024

Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning

d878524

ghstack-source-id: f7b71b9 Pull Request resolved: #119986

alexsamardzic added a commit that referenced this pull request Mar 14, 2024

Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning

06ae3ed

ghstack-source-id: bd65a5d Pull Request resolved: #119986

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 14, 2024

pytorchmergebot added the merging label Mar 14, 2024

pytorchmergebot added Merged and removed merging labels Mar 14, 2024

pytorchmergebot closed this in 83f8e51 Mar 14, 2024

github-actions Bot deleted the gh/alexsamardzic/25/head branch April 14, 2024 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning#119986

Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning#119986
alexsamardzic wants to merge 13 commits into
gh/alexsamardzic/25/basefrom
gh/alexsamardzic/25/head

alexsamardzic commented Feb 15, 2024 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Feb 15, 2024 •

edited

Loading

Uh oh!

alexsamardzic commented Feb 15, 2024 •

edited

Loading

Uh oh!

kadeng left a comment

Uh oh!

alexsamardzic commented Feb 28, 2024 •

edited

Loading

Uh oh!

cpuhrsch commented Mar 9, 2024

Uh oh!

Uh oh!

kadeng left a comment

Uh oh!

alexsamardzic commented Mar 12, 2024

Uh oh!

alexsamardzic commented Mar 13, 2024 •

edited

Loading

Uh oh!

alexsamardzic commented Mar 14, 2024

Uh oh!

pytorchmergebot commented Mar 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

alexsamardzic commented Feb 15, 2024 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119986

✅ You can merge normally! (8 Unrelated Failures)

Uh oh!

alexsamardzic commented Feb 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kadeng left a comment

Choose a reason for hiding this comment

Uh oh!

alexsamardzic commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpuhrsch commented Mar 9, 2024

Uh oh!

Uh oh!

kadeng left a comment

Choose a reason for hiding this comment

Uh oh!

alexsamardzic commented Mar 12, 2024

Uh oh!

alexsamardzic commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexsamardzic commented Mar 14, 2024

Uh oh!

pytorchmergebot commented Mar 14, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexsamardzic commented Feb 15, 2024 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Feb 15, 2024 •

edited

Loading

alexsamardzic commented Feb 15, 2024 •

edited

Loading

alexsamardzic commented Feb 28, 2024 •

edited

Loading

alexsamardzic commented Mar 13, 2024 •

edited

Loading