Add an option to disable reduced precision reductions for FP16 GEMM#67946
Add an option to disable reduced precision reductions for FP16 GEMM#67946eqy wants to merge 18 commits into
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit b852f29 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
|
Some GEMM shapes benchmarked on V100: |
Could we please document all these nuances. Perhaps adding a new dedicated doc that speaks specifically to precision vs performance control? Or adding it to the performance doc? It could then include the tf32 enabling way as well from one of the recent PRs. Thank you! |
|
I found the most relevant doc for this change: https://github.com/pytorch/pytorch/blob/master/docs/source/notes/numerical_accuracy.rst . So may be it should belong there and adding an xref from cuda.rst? |
|
I agree with @stas00, it makes sense to move the main portion of the docs to numerical_accuracy and expand it to mention that most of the math for gemms is done in fp32 precision, but, if reduced precision reduction is allowed, some intermediate results can be truncated to low precision, and cross-link it from cuda. Does this apply to bf16 also, btw? It's harder to establish because bf16 will only truncate mantissa, there won't be glaring overflows there. |
Since the original change was only for |
| fp16 GEMMs are potentially done with reduced precision reductions (e.g., in fp16 rather than fp32). This reduction in precision can allow for higher performance on certain workloads (particularly those with a large `k` dimension) and GPU architectures at the cost of numerical precision and potential for overflow. | ||
|
|
||
| Some example benchmark data on V100 | ||
| .. code:: |
| Reduced Precision Reduction in FP16 GEMMs | ||
| ----------------------------------------- | ||
|
|
||
| fp16 GEMMs are potentially done with reduced precision reductions (e.g., in fp16 rather than fp32). This reduction in precision can allow for higher performance on certain workloads (particularly those with a large `k` dimension) and GPU architectures at the cost of numerical precision and potential for overflow. |
There was a problem hiding this comment.
most of the GEMM accumulation is still done in fp32 precision, there are only a few truncations that are done, so can you please make the wording more accurate to not imply that all the accumulation is done in fp16?
| fp16 GEMMs are potentially done with reduced precision reductions (e.g., in fp16 rather than fp32). This reduction in precision can allow for higher performance on certain workloads (particularly those with a large `k` dimension) and GPU architectures at the cost of numerical precision and potential for overflow. | ||
|
|
||
| Some example benchmark data on V100 | ||
| .. code:: |
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…16 GEMM (#89172) Essentially the same change as #67946, except that the default is to disallow reduced precision reductions in `BFloat16` GEMMs (for now). If performance is severely regressed, we can change the default, but this option appears to be necessary to pass some `addmm` `BFloat16` tests on H100. CC @ptrblck @ngimel Pull Request resolved: #89172 Approved by: https://github.com/ngimel
…ytorch#67946) Summary: pytorch#67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = ` rather than making it the default behavior. CC ngimel ptrblck stas00 Note that the behavior after the previous PR can be replicated with `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False` Pull Request resolved: pytorch#67946 Reviewed By: zou3519 Differential Revision: D32289896 Pulled By: ngimel fbshipit-source-id: a1ea2918b77e27a7d9b391e030417802a0174abe
…16 GEMM (pytorch#89172) Essentially the same change as pytorch#67946, except that the default is to disallow reduced precision reductions in `BFloat16` GEMMs (for now). If performance is severely regressed, we can change the default, but this option appears to be necessary to pass some `addmm` `BFloat16` tests on H100. CC @ptrblck @ngimel Pull Request resolved: pytorch#89172 Approved by: https://github.com/ngimel

#67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction =rather than making it the default behavior.
CC @ngimel @ptrblck
@stas00 Note that the behavior after the previous PR can be replicated with
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False