[cuDNN V8 API] (reopen) Allow the number of kernels profiled under torch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit#77002
Conversation
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 753ae22 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
ec25620 to
5cb2a63
Compare
ngimel
left a comment
There was a problem hiding this comment.
This looks good, left small comments.
There was a problem hiding this comment.
how many cudnn specific params do we anticipate? Did the time come to have a special cudnn config in the global context, instead of querying benchmark/deterministic/tf32/benchmarklimit one-by-one? (No action required on this PR, but something to think about for the future)
torch/csrc/Module.cpp
Outdated
There was a problem hiding this comment.
should you warn unconditionally here? This can only be a result of user calling set_benchmark_limit, so whatever value is passed, ROCM should warn.
torch/csrc/Module.cpp
Outdated
There was a problem hiding this comment.
again, unconditional warning?
|
Resolve conflicts please? |
86ee006 to
5cc58ae
Compare
|
Have you checked the CI failures @eqy? I can't now, because github is showing unicorns. |
Looked spurious (e.g., jit tests) but I will rebase to check (also couldn't identify what was the cause of the docs failure). |
5cc58ae to
185061e
Compare
|
mea culpa, looks like some merge garbage was left in the docs by accident, hopefully fixed now |
|
@pytorchbot merge this |
|
Merge failed due to Matched rule superuser, but PR has not been reviewed yet |
|
@pytorchbot merge this |
|
Hey @eqy. |
…under torch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#77002)" This reverts commit c274f2a. Reverted #77002 on behalf of https://github.com/malfet due to please, as it breaks internal CI, but also no CUDA heads should be included from `torch/csrc/Module.cpp`, but rather should be implemented/registered in `torch/csrc/cuda/Module.cpp`
This can potentially catch issues like #77002 before they hit Meta-internal CI, because Bazel is more strict about header visibility than (our current setup of) CMake. [ghstack-poisoned]
This can potentially catch issues like #77002 before they hit Meta-internal CI, because Bazel is more strict about header visibility than (our current setup of) CMake. [ghstack-poisoned]
This can potentially catch issues like #77002 before they hit Meta-internal CI, because Bazel is more strict about header visibility than (our current setup of) CMake. [ghstack-poisoned]
This can potentially catch issues like #77002 before they hit Meta-internal CI, because Bazel is more strict about header visibility than (our current setup of) CMake. [ghstack-poisoned]
…rch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#77002) (reopening due to botched merge) The cuDNN V8 API (main support merged in #60755) potentially exposes many more kernels with benchmark=True. While these additional kernels can improve performance, it is often unnecessary to run every kernel returned by the heuristic and doing so may degrade the user experience by causing the first model iteration to be very slow. To alleviate this issue, this PR introduces torch.backends.cudnn.benchmark_limit. benchmark_limit specifies the maximum number of working cuDNN kernels to try for a given workload, with the default being 10 (similar to what TensorFlow does). benchmark_limit = 0 yields the current behavior of trying every kernel returned by the heuristic. CC @ptrblck @ngimel @xwang233 Pull Request resolved: #77002 Approved by: https://github.com/ngimel
This can potentially catch issues like #77002 before they hit Meta-internal CI, because Bazel is more strict about header visibility than (our current setup of) CMake. [ghstack-poisoned]
…h Bazel in CI" This can potentially catch issues like #77002 before they hit Meta-internal CI, because Bazel is more strict about header visibility than (our current setup of) CMake. [ghstack-poisoned]
This can potentially catch issues like #77002 before they hit Meta-internal CI, because Bazel is more strict about header visibility than (our current setup of) CMake. [ghstack-poisoned]
…h Bazel in CI" This can potentially catch issues like #77002 before they hit Meta-internal CI, because Bazel is more strict about header visibility than (our current setup of) CMake. [ghstack-poisoned]
This can potentially catch issues like #77002 before they hit Meta-internal CI, because Bazel is more strict about header visibility than (our current setup of) CMake. [ghstack-poisoned]
This can potentially catch issues like #77002 before they hit Meta-internal CI, because Bazel is more strict about header visibility than (our current setup of) CMake. Pull Request resolved: #78221 Approved by: https://github.com/malfet
Summary: This can potentially catch issues like #77002 before they hit Meta-internal CI, because Bazel is more strict about header visibility than (our current setup of) CMake. Pull Request resolved: #78221 Approved by: https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/8449ac770c29a52257a3f2842c1dfd06d2c2a497 Reviewed By: mehtanirav Differential Revision: D36707271 Pulled By: dreiss fbshipit-source-id: d9c3d17e9227b7660a0819e9894767035d10863f
This should prevent failures like #77002 from sneaking in as CUDAConfig.h would no longer be available for cpu builds. Note from 2018 about MIOpen builds do no seems relevant, though CUDAConfig.h is still needed by ROCm (tested in https://github.com/pytorch/pytorch/runs/6613660811?check_suite_focus=true build) Pull Request resolved: #78218 Approved by: https://github.com/seemethere, https://github.com/atalman
Summary: This should prevent failures like #77002 from sneaking in as CUDAConfig.h would no longer be available for cpu builds. Note from 2018 about MIOpen builds do no seems relevant, though CUDAConfig.h is still needed by ROCm (tested in https://github.com/pytorch/pytorch/runs/6613660811?check_suite_focus=true build) Pull Request resolved: #78218 Approved by: https://github.com/seemethere, https://github.com/atalman Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/dfd78bf4ab6ce42a85458553724556995fa0b4a4 Reviewed By: seemethere Differential Revision: D36783414 Pulled By: malfet fbshipit-source-id: 7277a395772c4dbdc6fc2f55a6e9b594f724b955
…torch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#78299) (#78299) Summary: Reopen of #77002 to address comments by malfet CC ngimel ptrblck Pull Request resolved: #78299 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/ae6dd20ba7725ea7c10d759f82891d86eb724c11 Reviewed By: mehtanirav, cpuhrsch Differential Revision: D37690491 fbshipit-source-id: 6c603a706663dcc7086755a5d4b6e382233dad7b
(reopening due to botched merge)
The cuDNN V8 API (main support merged in #60755) potentially exposes many more kernels with benchmark=True. While these additional kernels can improve performance, it is often unnecessary to run every kernel returned by the heuristic and doing so may degrade the user experience by causing the first model iteration to be very slow. To alleviate this issue, this PR introduces torch.backends.cudnn.benchmark_limit. benchmark_limit specifies the maximum number of working cuDNN kernels to try for a given workload, with the default being 10 (similar to what TensorFlow does). benchmark_limit = 0 yields the current behavior of trying every kernel returned by the heuristic.
CC @ptrblck @ngimel @xwang233