🐛 Bug
I get nondeterministic results when I run a model containing an nn.LSTM with dropout > 0 on the GPU, even when all seeds are set and torch.backends.cudnn.deterministic = True, torch.backends.cudnn.benchmark = False. Note that this issue is a near duplicate of #18110.
To Reproduce
I have a working example at https://gist.github.com/nimz/7da5db4031c523e61659c4afd443844d, that runs a single forward and backward pass of a simple model. It can be run with no arguments. If it is run multiple times, one can observe that the forward outputs are always the same, but some of the parameter gradients differ from run to run. This seems to only happen to the lstm.weight_ih_lX parameters.
Expected behavior
I would expect the runs to be exactly the same when run back-to-back on the same machine, but they are not. (This is true whether or not I use CUDA_VISIBLE_DEVICES=0, if that is helpful.)
Environment
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB
Nvidia driver version: 430.26
cuDNN version: Could not collect [torch.backends.cudnn.version() outputs 7603]
Versions of relevant libraries:
[pip] numpy==1.18.2
[pip] torch==1.4.0
[pip] torchvision==0.5.0
[conda] torch 1.4.0 pypi_0 pypi
[conda] torchvision 0.5.0 pypi_0 pypi
Additional context
The issue #18110 suggests that nondeterminism should be fixed in cuDNN 7.6.1, but I have cuDNN 7.6.3 according to the output of torch.backends.cudnn.version(), and yet this issue still arises. I did run the user-posted example in #18110, and that script does seem to give me deterministic results. It is also possible that torch.backends.cudnn.version() is incorrect. I believe I may actually be using cuDNN version 7.6.2, since my cudnn.h file contains
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 2
However, I do not know whether this is the version PyTorch is actually using. (Nevertheless, version 7.6.2 would still not explain the nondeterminism.)
cc @ngimel @csarofeen @ptrblck
🐛 Bug
I get nondeterministic results when I run a model containing an nn.LSTM with dropout > 0 on the GPU, even when all seeds are set and torch.backends.cudnn.deterministic = True, torch.backends.cudnn.benchmark = False. Note that this issue is a near duplicate of #18110.
To Reproduce
I have a working example at https://gist.github.com/nimz/7da5db4031c523e61659c4afd443844d, that runs a single forward and backward pass of a simple model. It can be run with no arguments. If it is run multiple times, one can observe that the forward outputs are always the same, but some of the parameter gradients differ from run to run. This seems to only happen to the lstm.weight_ih_lX parameters.
Expected behavior
I would expect the runs to be exactly the same when run back-to-back on the same machine, but they are not. (This is true whether or not I use CUDA_VISIBLE_DEVICES=0, if that is helpful.)
Environment
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB
Nvidia driver version: 430.26
cuDNN version: Could not collect [torch.backends.cudnn.version() outputs 7603]
Versions of relevant libraries:
[pip] numpy==1.18.2
[pip] torch==1.4.0
[pip] torchvision==0.5.0
[conda] torch 1.4.0 pypi_0 pypi
[conda] torchvision 0.5.0 pypi_0 pypi
Additional context
The issue #18110 suggests that nondeterminism should be fixed in cuDNN 7.6.1, but I have cuDNN 7.6.3 according to the output of torch.backends.cudnn.version(), and yet this issue still arises. I did run the user-posted example in #18110, and that script does seem to give me deterministic results. It is also possible that torch.backends.cudnn.version() is incorrect. I believe I may actually be using cuDNN version 7.6.2, since my cudnn.h file contains
However, I do not know whether this is the version PyTorch is actually using. (Nevertheless, version 7.6.2 would still not explain the nondeterminism.)
cc @ngimel @csarofeen @ptrblck