Skip to content

[cuDNN v8 API] cuDNN benchmark, convolution bwd / transposed convolution fwd, bfloat16, conv-bias-activation fusion#60755

Closed
eqy wants to merge 86 commits intopytorch:masterfrom
eqy:cudnn3
Closed

[cuDNN v8 API] cuDNN benchmark, convolution bwd / transposed convolution fwd, bfloat16, conv-bias-activation fusion#60755
eqy wants to merge 86 commits intopytorch:masterfrom
eqy:cudnn3

Conversation

@eqy
Copy link
Collaborator

@eqy eqy commented Jun 25, 2021

#58414, #58859, #58858 #58860 #58861

We're currently testing performance with both "find" and "get" with this PR.

CC @zasdfgbnm @ptrblck @ngimel @puririshi98

In addition to the USE_EXPERIMENTAL_CUDNN_V8_API build flag, we've added a CUDNN_V8_API_ENABLED runtime feature flag.
USE_EXPERIMENTAL_CUDNN_V8_API=1 will build with v8 API support while keeping all v7 functionality, with v8 usage disabled by default.
CUDNN_V8_API_ENABLED=1 at runtime on a USE_EXPERIMENTAL_CUDNN_V8_API=1 build uses the v8 API.
A debug flag CUDNN_V8_API_DEBUG=1 can be used to verify which API is used when dispatching convolutions.

Note that in v7, bfloat16 convolutions will dispatch to a native PyTorch implementation, but a fully v8 enabled build will dispatch to cuDNN implementations.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 25, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 15bcf38 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@ngimel ngimel self-requested a review June 25, 2021 21:25
@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 25, 2021
@eqy eqy force-pushed the cudnn3 branch 2 times, most recently from 09ad8be to 13fa945 Compare June 30, 2021 20:44
@eqy eqy changed the title [cuDNN v8 API] cuDNN benchmark, convolution bwd / transposed convolution fwd, bfloat16 [cuDNN v8 API] cuDNN benchmark, convolution bwd / transposed convolution fwd, bfloat16, conv-bias-activation fusion Jul 29, 2021
@eqy eqy force-pushed the cudnn3 branch 2 times, most recently from 00ee436 to 252e381 Compare July 29, 2021 23:11
@eqy eqy force-pushed the cudnn3 branch 3 times, most recently from a09ffdf to b8e262b Compare August 25, 2021 23:56
@pytorch-probot
Copy link

pytorch-probot bot commented Oct 5, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/eqy/pytorch/blob/12edeb093dd5a8f7a0ee083d3100b105121a38c5/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@jerryzh168
Copy link
Contributor

jerryzh168 commented Jan 28, 2022

Is this ready to land now? Would like to land this to get the third-party update..

@eqy
Copy link
Collaborator Author

eqy commented Jan 28, 2022

Is this ready to land now? Would like to land this to get the third-party update..

According to @ngimel, we're waiting for a branch cut first. Otherwise it should be ready.

facebook-github-bot pushed a commit that referenced this pull request Mar 2, 2022
…ion fwd, `bfloat16`, conv-bias-activation fusion (#60755)

Summary:
#58414, #58859, #58858 #58860 #58861

We're currently testing performance with both "find" and "get" with this PR.

CC zasdfgbnm ptrblck ngimel puririshi98

In addition to the `USE_EXPERIMENTAL_CUDNN_V8_API` build flag, we've added a `CUDNN_V8_API_ENABLED` runtime feature flag.
`USE_EXPERIMENTAL_CUDNN_V8_API=1` will build with v8 API support while keeping all v7 functionality, with v8 usage disabled by default.
`CUDNN_V8_API_ENABLED=1` at runtime on a `USE_EXPERIMENTAL_CUDNN_V8_API=1` build uses the v8 API.
A debug flag `CUDNN_V8_API_DEBUG=1` can be used to verify which API is used when dispatching convolutions.

Note that in v7, `bfloat16` convolutions will dispatch to a native PyTorch implementation, but a fully v8 enabled build will dispatch to cuDNN implementations.

Pull Request resolved: #60755

Reviewed By: mruberry

Differential Revision: D34393940

Pulled By: ngimel

fbshipit-source-id: 5c317d3aad63336ea416a51a43cf8b7d27aaca21
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2022

Hey @eqy.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
…ion fwd, `bfloat16`, conv-bias-activation fusion (#60755)

Summary:
pytorch/pytorch#58414, pytorch/pytorch#58859, pytorch/pytorch#58858 #58860 pytorch/pytorch#58861

We're currently testing performance with both "find" and "get" with this PR.

CC zasdfgbnm ptrblck ngimel puririshi98

In addition to the `USE_EXPERIMENTAL_CUDNN_V8_API` build flag, we've added a `CUDNN_V8_API_ENABLED` runtime feature flag.
`USE_EXPERIMENTAL_CUDNN_V8_API=1` will build with v8 API support while keeping all v7 functionality, with v8 usage disabled by default.
`CUDNN_V8_API_ENABLED=1` at runtime on a `USE_EXPERIMENTAL_CUDNN_V8_API=1` build uses the v8 API.
A debug flag `CUDNN_V8_API_DEBUG=1` can be used to verify which API is used when dispatching convolutions.

Note that in v7, `bfloat16` convolutions will dispatch to a native PyTorch implementation, but a fully v8 enabled build will dispatch to cuDNN implementations.

Pull Request resolved: pytorch/pytorch#60755

Reviewed By: mruberry

Differential Revision: D34393940

Pulled By: ngimel

fbshipit-source-id: 5c317d3aad63336ea416a51a43cf8b7d27aaca21
(cherry picked from commit 3bfc549ce57cee691f83dc894ac7adb4b7882459)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
…ion fwd, `bfloat16`, conv-bias-activation fusion (#60755)

Summary:
pytorch/pytorch#58414, pytorch/pytorch#58859, pytorch/pytorch#58858 #58860 pytorch/pytorch#58861

We're currently testing performance with both "find" and "get" with this PR.

CC zasdfgbnm ptrblck ngimel puririshi98

In addition to the `USE_EXPERIMENTAL_CUDNN_V8_API` build flag, we've added a `CUDNN_V8_API_ENABLED` runtime feature flag.
`USE_EXPERIMENTAL_CUDNN_V8_API=1` will build with v8 API support while keeping all v7 functionality, with v8 usage disabled by default.
`CUDNN_V8_API_ENABLED=1` at runtime on a `USE_EXPERIMENTAL_CUDNN_V8_API=1` build uses the v8 API.
A debug flag `CUDNN_V8_API_DEBUG=1` can be used to verify which API is used when dispatching convolutions.

Note that in v7, `bfloat16` convolutions will dispatch to a native PyTorch implementation, but a fully v8 enabled build will dispatch to cuDNN implementations.

Pull Request resolved: pytorch/pytorch#60755

Reviewed By: mruberry

Differential Revision: D34393940

Pulled By: ngimel

fbshipit-source-id: 5c317d3aad63336ea416a51a43cf8b7d27aaca21
(cherry picked from commit 3bfc549ce57cee691f83dc894ac7adb4b7882459)
dzdang added a commit that referenced this pull request May 4, 2022
…8.cpp

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

[ghstack-poisoned]
dzdang added a commit that referenced this pull request May 4, 2022
…8.cpp

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

ghstack-source-id: 01200fa
Pull Request resolved: #76788
dzdang added a commit that referenced this pull request May 11, 2022
…8.cpp

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

ghstack-source-id: 67f07b4
Pull Request resolved: #76788
dzdang added a commit that referenced this pull request May 11, 2022
…and run conform with Conv_v8.cpp"

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582)

[ghstack-poisoned]
dzdang added a commit that referenced this pull request May 11, 2022
…with Conv_v8.cpp"

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582)

[ghstack-poisoned]
dzdang added a commit that referenced this pull request May 16, 2022
…and run for quantized cudnn conv op conform with Conv_v8.cpp"

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582)

[ghstack-poisoned]
dzdang added a commit that referenced this pull request May 16, 2022
… conv op conform with Conv_v8.cpp

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

ghstack-source-id: 6ba2b1d
Pull Request resolved: #76788
dzdang added a commit that referenced this pull request May 16, 2022
…tized cudnn conv op conform with Conv_v8.cpp"

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582)

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request May 24, 2022
…rch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#77002)

(reopening due to botched merge)
The cuDNN V8 API (main support merged in #60755) potentially exposes many more kernels with benchmark=True. While these additional kernels can improve performance, it is often unnecessary to run every kernel returned by the heuristic and doing so may degrade the user experience by causing the first model iteration to be very slow. To alleviate this issue, this PR introduces torch.backends.cudnn.benchmark_limit. benchmark_limit specifies the maximum number of working cuDNN kernels to try for a given workload, with the default being 10 (similar to what TensorFlow does). benchmark_limit = 0 yields the current behavior of trying every kernel returned by the heuristic.

CC @ptrblck @ngimel @xwang233
Pull Request resolved: #77002
Approved by: https://github.com/ngimel
dzdang added a commit that referenced this pull request May 24, 2022
…and run for quantized cudnn conv op conform with Conv_v8.cpp"

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582)

[ghstack-poisoned]
dzdang added a commit that referenced this pull request May 24, 2022
…tized cudnn conv op conform with Conv_v8.cpp"

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582)

[ghstack-poisoned]
dzdang added a commit that referenced this pull request May 24, 2022
…and run for quantized cudnn conv op conform with Conv_v8.cpp"

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582)

[ghstack-poisoned]
dzdang added a commit that referenced this pull request May 24, 2022
…tized cudnn conv op conform with Conv_v8.cpp"

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

Differential Revision: [D36121582](https://our.internmc.facebook.com/intern/diff/D36121582)

[ghstack-poisoned]
dzdang added a commit that referenced this pull request May 24, 2022
… conv op conform with Conv_v8.cpp

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

ghstack-source-id: 141c0db
Pull Request resolved: #76788
pytorchmergebot pushed a commit that referenced this pull request May 25, 2022
… conv op conform with Conv_v8.cpp

Summary:
The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

Pull Request resolved: #76788

Approved by: https://github.com/jerryzh168
facebook-github-bot pushed a commit that referenced this pull request May 25, 2022
… conv op conform with Conv_v8.cpp (#76788)

Summary:
Pull Request resolved: #76788

The workspace and plan in Conv_v8.cpp diverged from Conv.cpp after
#60755 landed. This PR conforms
the two files in regards to the workspace and plan.

Test Plan:
```
python test/test_quantization.py -k test_qconv2d_cudnn
```

```
python test/test_quantization.py -k test_qconv2d_cudnn
```

Differential Revision:
D36121582
D36121582

Reviewed By: jerryzh168

Pulled By: dzdang

fbshipit-source-id: 4d23817a7603cc36af47911d8543b92bdcc26617
swang392 pushed a commit that referenced this pull request May 25, 2022
…rch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#77002)

(reopening due to botched merge)
The cuDNN V8 API (main support merged in #60755) potentially exposes many more kernels with benchmark=True. While these additional kernels can improve performance, it is often unnecessary to run every kernel returned by the heuristic and doing so may degrade the user experience by causing the first model iteration to be very slow. To alleviate this issue, this PR introduces torch.backends.cudnn.benchmark_limit. benchmark_limit specifies the maximum number of working cuDNN kernels to try for a given workload, with the default being 10 (similar to what TensorFlow does). benchmark_limit = 0 yields the current behavior of trying every kernel returned by the heuristic.

CC @ptrblck @ngimel @xwang233
Pull Request resolved: #77002
Approved by: https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants