Skip to content

nn.LSTM gives nondeterministic results with dropout and multiple layers, OR cuDNN version mismatch #35661

@nimz

Description

@nimz

🐛 Bug

I get nondeterministic results when I run a model containing an nn.LSTM with dropout > 0 on the GPU, even when all seeds are set and torch.backends.cudnn.deterministic = True, torch.backends.cudnn.benchmark = False. Note that this issue is a near duplicate of #18110.

To Reproduce

I have a working example at https://gist.github.com/nimz/7da5db4031c523e61659c4afd443844d, that runs a single forward and backward pass of a simple model. It can be run with no arguments. If it is run multiple times, one can observe that the forward outputs are always the same, but some of the parameter gradients differ from run to run. This seems to only happen to the lstm.weight_ih_lX parameters.

Expected behavior

I would expect the runs to be exactly the same when run back-to-back on the same machine, but they are not. (This is true whether or not I use CUDA_VISIBLE_DEVICES=0, if that is helpful.)

Environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB

Nvidia driver version: 430.26
cuDNN version: Could not collect [torch.backends.cudnn.version() outputs 7603]

Versions of relevant libraries:
[pip] numpy==1.18.2
[pip] torch==1.4.0
[pip] torchvision==0.5.0
[conda] torch 1.4.0 pypi_0 pypi
[conda] torchvision 0.5.0 pypi_0 pypi

Additional context

The issue #18110 suggests that nondeterminism should be fixed in cuDNN 7.6.1, but I have cuDNN 7.6.3 according to the output of torch.backends.cudnn.version(), and yet this issue still arises. I did run the user-posted example in #18110, and that script does seem to give me deterministic results. It is also possible that torch.backends.cudnn.version() is incorrect. I believe I may actually be using cuDNN version 7.6.2, since my cudnn.h file contains

#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 2

However, I do not know whether this is the version PyTorch is actually using. (Nevertheless, version 7.6.2 would still not explain the nondeterminism.)

cc @ngimel @csarofeen @ptrblck

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cudaRelated to torch.cuda, and CUDA support in generalmodule: cudnnRelated to torch.backends.cudnn, and CuDNN supportmodule: determinismtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions