Closed
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D58117182 |
htyu
added a commit
to htyu/FBGEMM
that referenced
this pull request
Jun 14, 2024
Summary: Pull Request resolved: pytorch#2735 Enabling persistent kernels for row-wise fp8_fast_accum=True/False Differential Revision: D58117182
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D58117182 |
1 similar comment
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D58117182 |
htyu
added a commit
to htyu/FBGEMM
that referenced
this pull request
Jun 14, 2024
Summary: Pull Request resolved: pytorch#2735 Enabling persistent kernels for row-wise fp8_fast_accum=True/False Differential Revision: D58117182
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D58117182 |
htyu
added a commit
to htyu/FBGEMM
that referenced
this pull request
Jun 14, 2024
Summary: Pull Request resolved: pytorch#2735 Enabling persistent kernels for row-wise fp8_fast_accum=True/False based on the Triton upstream implemenation. Differential Revision: D58117182
Summary: Pull Request resolved: pytorch#2735 Enabling persistent kernels for row-wise fp8_fast_accum=True/False based on the Triton upstream implementation triton-lang/triton#4099. Differential Revision: D58117182
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D58117182 |
Contributor
|
This pull request has been merged in 8a938d6. |
pytorchmergebot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Jul 30, 2024
Add the Inductor lowering for `torch._scaled_mm`, whose API was last updated in #128683. The lowering does: - for tensor-wise scaling, auto-tune between the default ATen kernel (cuBLAS) and Triton kernel configurations. - for row-wise scaling, auto-tune between the default ATen kernel (CUTLASS kernel added in #125204) and Triton kernel configurations. The Triton kernel template is based on htyu/FBGEMM@3ad9031 (D56337896) by @choutim, without using SPLIT_K, and that of mm `torch/_inductor/kernel/mm.py` ## Testing: - Logging shows max-autotune tuning (`AUTOTUNE scaled_mm`) for both tensor-wise and row-wise scaling when called with the two scaling types. - Row-wise scaling allows operator fusion between preceding pointwise/reduction op and amax/cast: - output code Evaluating m=256, n=256, k=256, fusion_case='pointwise', scaling_mode='row' - P1477224245 - 2 kernels - output code Evaluating m=2048, n=256, k=2048, fusion_case='reduction', scaling_mode='row' - P1477227340 - 2 kernels - UT `python test/inductor/test_fp8.py -- TestFP8Lowering` ## Benchmarking Eager/compiled tensor-wise/row-wise scaling for various shapes: https://docs.google.com/spreadsheets/d/1VfWEVuyrwoWysfbS0_u2VHJ-PsdWkF1qIsiD60AzTes/edit?gid=2113587669#gid=2113587669 - Some of the “compiled” cases are slightly slower than “eager”. It’s because max-autotune selected the ATen kernel in the compiled case, and I think the discrepancy is variance. Eager/compiled tensor-wise/row-wise scaling with pointwise/reduction preceding op for various shapes: https://docs.google.com/spreadsheets/d/1Nv07NrdffQIoDeMjo9E0V-E-EYrEN0WysO_bn1bc6ns/edit?gid=1715488446#gid=1715488446 ## Questions for reviewers: - Should the type of the accumulator `ACC_TYPE` always be in float32? If not, where is this type set (output layout?)? ## Todo: - Make the Triton template use the improved persistent kernel version (pytorch/FBGEMM#2735 by @htyu) Pull Request resolved: #130422 Approved by: https://github.com/ipiszy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary: Enabling persistent kernels for row-wise fp8_fast_accum=True/False
Differential Revision: D58117182