I've been discussing - originally for a dummy question - on the forum about it: https://discuss.pytorch.org/t/some-variables-are-not-affected-by-cuda/2079
I was originally working with Jupyter, the notebook is here: https://github.com/pltrdy/pytorchwork/blob/master/rnn_lm.ipynb
For convenience I also generated the corresponding .py script:
https://github.com/pltrdy/pytorchwork/blob/master/rnn_lm.py
I have a simple model that, that work as expected when working with CPU, but raises Seg Fault when I work with .cuda().
From simple debugging I found that the segfault was fired at the line loss.backward().
Then as suggested by @albanD I ran gdb --args python rnn_lm.py giving the following output:
$ gdb --args python rnn_lm.py
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run
Starting program: /home/moses/anaconda3/bin/python rnn_lm.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Loading data...
Loaded 929589 training words
73760 validation words
82430 test words
Vocabulary: 10000
Creating model...
Using CUDA
[New Thread 0x7fffb198e700 (LWP 17297)]
[New Thread 0x7fffb118d700 (LWP 17298)]
[New Thread 0x7fffb098c700 (LWP 17299)]
Done.
Epoch 1
726.24140625
36
0.0.0
[New Thread 0x7fffaff8a780 (LWP 17309)]
[New Thread 0x7fffafb89800 (LWP 17310)]
[New Thread 0x7fffaf788880 (LWP 17311)]
[New Thread 0x7fffaf387900 (LWP 17312)]
[New Thread 0x7fffaef86980 (LWP 17313)]
[New Thread 0x7fffaeb85a00 (LWP 17314)]
[New Thread 0x7fffae784a80 (LWP 17315)]
[ ... irrelevant output initially in the script ... ]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffad893700 (LWP 17317)]
torch::autograd::GradBuffer::addGrad(unsigned long, std::shared_ptr<torch::autograd::Variable>&&) (
this=this@entry=0x7fffad892c40, pos=pos@entry=0,
var=var@entry=<unknown type in /home/moses/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so, CU 0x6e92e4, DIE 0x6fcb24>) at torch/csrc/autograd/grad_buffer.cpp:17
17 torch/csrc/autograd/grad_buffer.cpp: No such file or directory.
(gdb) bt
#0 torch::autograd::GradBuffer::addGrad(unsigned long, std::shared_ptr<torch::autograd::Variable>&&) (
this=this@entry=0x7fffad892c40, pos=pos@entry=0,
var=var@entry=<unknown type in /home/moses/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so, CU 0x6e92e4, DIE 0x6fcb24>) at torch/csrc/autograd/grad_buffer.cpp:17
#1 0x00007fffed45c9a1 in torch::autograd::Engine::evaluate_function (this=this@entry=0x7fffedcd7ce0 <engine>,
task=...) from /home/<user>/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#2 0x00007fffed45dd1a in torch::autograd::Engine::thread_main (this=this@entry=0x7fffedcd7ce0 <engine>, queue=...)
from /home/<user>/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#3 0x00007fffed46e87a in PythonEngine::thread_main (this=0x7fffedcd7ce0 <engine>, queue=...)
from /home/<user>/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#4 0x00007ffff652d870 in ?? () from /home/<user>/anaconda3/bin/../lib/libstdc++.so.6
#5 0x00007ffff7474184 in start_thread (arg=0x7fffad893700) at pthread_create.c:312
#6 0x00007ffff688cbed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)
Config
- Ubuntu 14.04.5 LTS
- Python Interpreter:
Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00)
- GPU: single Nvidia 1080,
nvidia-smi says:
NVIDIA-SMI 375.39 ; Driver Version: 375.39
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
Thanks for any help or suggestions about it.
pltrdy
I've been discussing - originally for a dummy question - on the forum about it: https://discuss.pytorch.org/t/some-variables-are-not-affected-by-cuda/2079
I was originally working with Jupyter, the notebook is here: https://github.com/pltrdy/pytorchwork/blob/master/rnn_lm.ipynb
For convenience I also generated the corresponding
.pyscript:https://github.com/pltrdy/pytorchwork/blob/master/rnn_lm.py
I have a simple model that, that work as expected when working with CPU, but raises Seg Fault when I work with
.cuda().From simple debugging I found that the segfault was fired at the line
loss.backward().Then as suggested by @albanD I ran
gdb --args python rnn_lm.pygiving the following output:Config
Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00)nvidia-smisays:$nvcc --version:Thanks for any help or suggestions about it.
pltrdy