[cuDNN v8 API] cuDNN benchmark, convolution bwd / transposed convolution fwd, bfloat16, conv-bias-activation fusion#60755
[cuDNN v8 API] cuDNN benchmark, convolution bwd / transposed convolution fwd, bfloat16, conv-bias-activation fusion#60755eqy wants to merge 86 commits intopytorch:masterfrom
bfloat16, conv-bias-activation fusion#60755Conversation
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 15bcf38 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
09ad8be to
13fa945
Compare
bfloat16bfloat16, conv-bias-activation fusion
00ee436 to
252e381
Compare
a09ffdf to
b8e262b
Compare
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
33f149d to
9ba9478
Compare
|
Is this ready to land now? Would like to land this to get the third-party update.. |
According to @ngimel, we're waiting for a branch cut first. Otherwise it should be ready. |
…ion fwd, `bfloat16`, conv-bias-activation fusion (#60755) Summary: #58414, #58859, #58858 #58860 #58861 We're currently testing performance with both "find" and "get" with this PR. CC zasdfgbnm ptrblck ngimel puririshi98 In addition to the `USE_EXPERIMENTAL_CUDNN_V8_API` build flag, we've added a `CUDNN_V8_API_ENABLED` runtime feature flag. `USE_EXPERIMENTAL_CUDNN_V8_API=1` will build with v8 API support while keeping all v7 functionality, with v8 usage disabled by default. `CUDNN_V8_API_ENABLED=1` at runtime on a `USE_EXPERIMENTAL_CUDNN_V8_API=1` build uses the v8 API. A debug flag `CUDNN_V8_API_DEBUG=1` can be used to verify which API is used when dispatching convolutions. Note that in v7, `bfloat16` convolutions will dispatch to a native PyTorch implementation, but a fully v8 enabled build will dispatch to cuDNN implementations. Pull Request resolved: #60755 Reviewed By: mruberry Differential Revision: D34393940 Pulled By: ngimel fbshipit-source-id: 5c317d3aad63336ea416a51a43cf8b7d27aaca21
|
Hey @eqy. |
…ion fwd, `bfloat16`, conv-bias-activation fusion (#60755) Summary: pytorch/pytorch#58414, pytorch/pytorch#58859, pytorch/pytorch#58858 #58860 pytorch/pytorch#58861 We're currently testing performance with both "find" and "get" with this PR. CC zasdfgbnm ptrblck ngimel puririshi98 In addition to the `USE_EXPERIMENTAL_CUDNN_V8_API` build flag, we've added a `CUDNN_V8_API_ENABLED` runtime feature flag. `USE_EXPERIMENTAL_CUDNN_V8_API=1` will build with v8 API support while keeping all v7 functionality, with v8 usage disabled by default. `CUDNN_V8_API_ENABLED=1` at runtime on a `USE_EXPERIMENTAL_CUDNN_V8_API=1` build uses the v8 API. A debug flag `CUDNN_V8_API_DEBUG=1` can be used to verify which API is used when dispatching convolutions. Note that in v7, `bfloat16` convolutions will dispatch to a native PyTorch implementation, but a fully v8 enabled build will dispatch to cuDNN implementations. Pull Request resolved: pytorch/pytorch#60755 Reviewed By: mruberry Differential Revision: D34393940 Pulled By: ngimel fbshipit-source-id: 5c317d3aad63336ea416a51a43cf8b7d27aaca21 (cherry picked from commit 3bfc549ce57cee691f83dc894ac7adb4b7882459)
…ion fwd, `bfloat16`, conv-bias-activation fusion (#60755) Summary: pytorch/pytorch#58414, pytorch/pytorch#58859, pytorch/pytorch#58858 #58860 pytorch/pytorch#58861 We're currently testing performance with both "find" and "get" with this PR. CC zasdfgbnm ptrblck ngimel puririshi98 In addition to the `USE_EXPERIMENTAL_CUDNN_V8_API` build flag, we've added a `CUDNN_V8_API_ENABLED` runtime feature flag. `USE_EXPERIMENTAL_CUDNN_V8_API=1` will build with v8 API support while keeping all v7 functionality, with v8 usage disabled by default. `CUDNN_V8_API_ENABLED=1` at runtime on a `USE_EXPERIMENTAL_CUDNN_V8_API=1` build uses the v8 API. A debug flag `CUDNN_V8_API_DEBUG=1` can be used to verify which API is used when dispatching convolutions. Note that in v7, `bfloat16` convolutions will dispatch to a native PyTorch implementation, but a fully v8 enabled build will dispatch to cuDNN implementations. Pull Request resolved: pytorch/pytorch#60755 Reviewed By: mruberry Differential Revision: D34393940 Pulled By: ngimel fbshipit-source-id: 5c317d3aad63336ea416a51a43cf8b7d27aaca21 (cherry picked from commit 3bfc549ce57cee691f83dc894ac7adb4b7882459)
…8.cpp Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` [ghstack-poisoned]
…and run conform with Conv_v8.cpp" Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582) [ghstack-poisoned]
…with Conv_v8.cpp" Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582) [ghstack-poisoned]
…and run for quantized cudnn conv op conform with Conv_v8.cpp" Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582) [ghstack-poisoned]
… conv op conform with Conv_v8.cpp Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` ghstack-source-id: 6ba2b1d Pull Request resolved: #76788
…tized cudnn conv op conform with Conv_v8.cpp" Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582) [ghstack-poisoned]
…rch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#77002) (reopening due to botched merge) The cuDNN V8 API (main support merged in #60755) potentially exposes many more kernels with benchmark=True. While these additional kernels can improve performance, it is often unnecessary to run every kernel returned by the heuristic and doing so may degrade the user experience by causing the first model iteration to be very slow. To alleviate this issue, this PR introduces torch.backends.cudnn.benchmark_limit. benchmark_limit specifies the maximum number of working cuDNN kernels to try for a given workload, with the default being 10 (similar to what TensorFlow does). benchmark_limit = 0 yields the current behavior of trying every kernel returned by the heuristic. CC @ptrblck @ngimel @xwang233 Pull Request resolved: #77002 Approved by: https://github.com/ngimel
…and run for quantized cudnn conv op conform with Conv_v8.cpp" Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582) [ghstack-poisoned]
…tized cudnn conv op conform with Conv_v8.cpp" Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582) [ghstack-poisoned]
…and run for quantized cudnn conv op conform with Conv_v8.cpp" Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582) [ghstack-poisoned]
…tized cudnn conv op conform with Conv_v8.cpp" Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582) [ghstack-poisoned]
… conv op conform with Conv_v8.cpp Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` ghstack-source-id: 141c0db Pull Request resolved: #76788
… conv op conform with Conv_v8.cpp Summary: The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` Pull Request resolved: #76788 Approved by: https://github.com/jerryzh168
… conv op conform with Conv_v8.cpp (#76788) Summary: Pull Request resolved: #76788 The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after #60755 landed. This PR conforms the two files in regards to the workspace and plan. Test Plan: ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` ``` python test/test_quantization.py -k test_qconv2d_cudnn ``` Differential Revision: D36121582 D36121582 Reviewed By: jerryzh168 Pulled By: dzdang fbshipit-source-id: 4d23817a7603cc36af47911d8543b92bdcc26617
…rch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#77002) (reopening due to botched merge) The cuDNN V8 API (main support merged in #60755) potentially exposes many more kernels with benchmark=True. While these additional kernels can improve performance, it is often unnecessary to run every kernel returned by the heuristic and doing so may degrade the user experience by causing the first model iteration to be very slow. To alleviate this issue, this PR introduces torch.backends.cudnn.benchmark_limit. benchmark_limit specifies the maximum number of working cuDNN kernels to try for a given workload, with the default being 10 (similar to what TensorFlow does). benchmark_limit = 0 yields the current behavior of trying every kernel returned by the heuristic. CC @ptrblck @ngimel @xwang233 Pull Request resolved: #77002 Approved by: https://github.com/ngimel
#58414, #58859, #58858 #58860 #58861
We're currently testing performance with both "find" and "get" with this PR.
CC @zasdfgbnm @ptrblck @ngimel @puririshi98
In addition to the
USE_EXPERIMENTAL_CUDNN_V8_APIbuild flag, we've added aCUDNN_V8_API_ENABLEDruntime feature flag.USE_EXPERIMENTAL_CUDNN_V8_API=1will build with v8 API support while keeping all v7 functionality, with v8 usage disabled by default.CUDNN_V8_API_ENABLED=1at runtime on aUSE_EXPERIMENTAL_CUDNN_V8_API=1build uses the v8 API.A debug flag
CUDNN_V8_API_DEBUG=1can be used to verify which API is used when dispatching convolutions.Note that in v7,
bfloat16convolutions will dispatch to a native PyTorch implementation, but a fully v8 enabled build will dispatch to cuDNN implementations.