nn.LSTM gives nondeterministic results with dropout and multiple layers, OR cuDNN version mismatch

## 🐛 Bug

I get nondeterministic results when I run a model containing an nn.LSTM with dropout > 0 on the GPU, even when all seeds are set and torch.backends.cudnn.deterministic = True, torch.backends.cudnn.benchmark = False. Note that this issue is a near duplicate of #18110.

## To Reproduce

I have a working example at https://gist.github.com/nimz/7da5db4031c523e61659c4afd443844d, that runs a single forward and backward pass of a simple model. It can be run with no arguments. If it is run multiple times, one can observe that the forward outputs are always the same, but some of the parameter gradients differ from run to run. This seems to only happen to the lstm.weight_ih_lX parameters.

## Expected behavior

I would expect the runs to be exactly the same when run back-to-back on the same machine, but they are not. (This is true whether or not I use CUDA_VISIBLE_DEVICES=0, if that is helpful.)

## Environment
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB

Nvidia driver version: 430.26
cuDNN version: Could not collect [torch.backends.cudnn.version() outputs 7603]

Versions of relevant libraries:
[pip] numpy==1.18.2
[pip] torch==1.4.0
[pip] torchvision==0.5.0
[conda] torch                     1.4.0                    pypi_0    pypi
[conda] torchvision               0.5.0                    pypi_0    pypi

## Additional context

The issue https://github.com/pytorch/pytorch/issues/18110 suggests that nondeterminism should be fixed in cuDNN 7.6.1, but I have cuDNN 7.6.3 according to the output of torch.backends.cudnn.version(), and yet this issue still arises. I did run the user-posted example in #18110, and that script does seem to give me deterministic results. It is also possible that torch.backends.cudnn.version() is incorrect. I believe I may actually be using cuDNN version 7.6.2, since my cudnn.h file contains

```
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 2
```
However, I do not know whether this is the version PyTorch is actually using. (Nevertheless, version 7.6.2 would still not explain the nondeterminism.)

cc @ngimel @csarofeen @ptrblck

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nn.LSTM gives nondeterministic results with dropout and multiple layers, OR cuDNN version mismatch #35661

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nn.LSTM gives nondeterministic results with dropout and multiple layers, OR cuDNN version mismatch #35661

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions