-
Notifications
You must be signed in to change notification settings - Fork 27.2k
Closed
Labels
Description
I'm running Ubuntu 16.04.3 on an AWS P3.2xlarge, with the NVIDIA libcuda1-384 package, version 384.111-0ubuntu0.16.04.1, and pytorch version 0.3.0.post4 and cuda90, installed via conda.
In some circumstances, the CUDA multinomial sampler with replacement=True can sample an event with zero probability.
The script below reproduces this. I haven't tried to reduce the distribution to a minimal case. The output is below the script.
I've attached the RNG state just prior to the failure. Reproduction from this state is demonstrated in the script.
Bests regards,
Alex
import torch as T
import os
freqs = T.cuda.FloatTensor([
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.03178183361887932, 0.027680952101945877, 0.033176131546497345,
0.046052902936935425, 0.07742464542388916, 0.11543981730937958,
0.14148041605949402, 0.15784293413162231, 0.13180233538150787,
0.08271478116512299, 0.049702685326337814, 0.027557924389839172,
0.018125897273421288, 0.011851548217236996, 0.010252203792333603,
0.007422595750540495, 0.005372154992073774, 0.0045109698548913,
0.0036087757907807827, 0.0035267581697553396, 0.0018864056328311563,
0.0024605290964245796, 0.0022964938543736935, 0.0018453967059031129,
0.0010662291897460818, 0.0009842115687206388, 0.00045109697384759784,
0.0007791675161570311, 0.00020504408166743815, 0.00020504408166743815,
0.00020504408166743815, 0.00012302644609007984, 0.0,
0.00012302644609007984, 4.100881778867915e-05, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0])
for i in range(1_000_000_000):
state = T.cuda.get_rng_state()
sample = T.multinomial(freqs, 1000, True)
if freqs[sample].min() == 0:
break
print(72*'-')
print(f'failure after {i} iterations')
sample_idx = (freqs[sample] == 0).nonzero()[0][0]
sampled = sample[sample_idx]
print(f'{sample_idx}th element of last sample was {sampled}, '
f'which has probability {freqs[sampled]}')
print(72*'-')
SAVE_PATH = 'failed_state.pt'
T.save(state, open(SAVE_PATH, 'wb'))
print(f'\nRNG state saved to `{SAVE_PATH}`')
T.cuda.set_rng_state(T.load(open(SAVE_PATH, 'rb')))
assert freqs[T.multinomial(freqs, 1000, True)].min() == 0
print(72*'-')
print(f'\npytorch version: {T.__version__}\n')
print(72*'-')
print('Operating system:')
os.system('lsb_release -a')
print(72*'-')
print('\nGPU information:')
os.system('nvidia-smi')
print(72*'-')
print('\nCUDA library information:')
os.system('dpkg -l | grep -i cuda')ubuntu@ip-172-30-1-60:~/workdir$ python -i fail.py
------------------------------------------------------------------------
failure after 4309 iterations
631th element of last sample was 0, which has probability 0.0
------------------------------------------------------------------------
RNG state saved to `failed_state.pt`
------------------------------------------------------------------------
pytorch version: 0.3.0.post4
------------------------------------------------------------------------
Operating system:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial
------------------------------------------------------------------------
GPU information:
Thu Jan 25 21:17:03 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 |
| N/A 38C P0 46W / 300W | 4558MiB / 16152MiB | 31% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 988 C python 738MiB |
| 0 15193 C python 3810MiB |
+-----------------------------------------------------------------------------+
------------------------------------------------------------------------
CUDA library information:
ii libcuda1-384 384.111-0ubuntu0.16.04.1 amd64 NVIDIA CUDA runtime library
Reactions are currently unavailable