Skip to content

Segmentation Fault during .backward() #1318

@pltrdy

Description

@pltrdy

I've been discussing - originally for a dummy question - on the forum about it: https://discuss.pytorch.org/t/some-variables-are-not-affected-by-cuda/2079

I was originally working with Jupyter, the notebook is here: https://github.com/pltrdy/pytorchwork/blob/master/rnn_lm.ipynb
For convenience I also generated the corresponding .py script:
https://github.com/pltrdy/pytorchwork/blob/master/rnn_lm.py

I have a simple model that, that work as expected when working with CPU, but raises Seg Fault when I work with .cuda().

From simple debugging I found that the segfault was fired at the line loss.backward().
Then as suggested by @albanD I ran gdb --args python rnn_lm.py giving the following output:

$ gdb --args python rnn_lm.py
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run
Starting program: /home/moses/anaconda3/bin/python rnn_lm.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Loading data...
Loaded	929589 training words
	73760 validation words
	82430 test words
Vocabulary: 10000
Creating model...
Using CUDA
[New Thread 0x7fffb198e700 (LWP 17297)]
[New Thread 0x7fffb118d700 (LWP 17298)]
[New Thread 0x7fffb098c700 (LWP 17299)]
Done.
Epoch 1
726.24140625
36
0.0.0
[New Thread 0x7fffaff8a780 (LWP 17309)]
[New Thread 0x7fffafb89800 (LWP 17310)]
[New Thread 0x7fffaf788880 (LWP 17311)]
[New Thread 0x7fffaf387900 (LWP 17312)]
[New Thread 0x7fffaef86980 (LWP 17313)]
[New Thread 0x7fffaeb85a00 (LWP 17314)]
[New Thread 0x7fffae784a80 (LWP 17315)]

[ ... irrelevant output initially in the script ... ]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffad893700 (LWP 17317)]
torch::autograd::GradBuffer::addGrad(unsigned long, std::shared_ptr<torch::autograd::Variable>&&) (
    this=this@entry=0x7fffad892c40, pos=pos@entry=0, 
    var=var@entry=<unknown type in /home/moses/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so, CU 0x6e92e4, DIE 0x6fcb24>) at torch/csrc/autograd/grad_buffer.cpp:17
17	torch/csrc/autograd/grad_buffer.cpp: No such file or directory.
(gdb) bt
#0  torch::autograd::GradBuffer::addGrad(unsigned long, std::shared_ptr<torch::autograd::Variable>&&) (
    this=this@entry=0x7fffad892c40, pos=pos@entry=0, 
    var=var@entry=<unknown type in /home/moses/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so, CU 0x6e92e4, DIE 0x6fcb24>) at torch/csrc/autograd/grad_buffer.cpp:17
#1  0x00007fffed45c9a1 in torch::autograd::Engine::evaluate_function (this=this@entry=0x7fffedcd7ce0 <engine>, 
    task=...) from /home/<user>/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#2  0x00007fffed45dd1a in torch::autograd::Engine::thread_main (this=this@entry=0x7fffedcd7ce0 <engine>, queue=...)
   from /home/<user>/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#3  0x00007fffed46e87a in PythonEngine::thread_main (this=0x7fffedcd7ce0 <engine>, queue=...)
   from /home/<user>/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#4  0x00007ffff652d870 in ?? () from /home/<user>/anaconda3/bin/../lib/libstdc++.so.6
#5  0x00007ffff7474184 in start_thread (arg=0x7fffad893700) at pthread_create.c:312
#6  0x00007ffff688cbed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

Config

  • Ubuntu 14.04.5 LTS
  • Python Interpreter: Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00)
  • GPU: single Nvidia 1080, nvidia-smi says:
NVIDIA-SMI 375.39 ; Driver Version: 375.39
  • $nvcc --version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

Thanks for any help or suggestions about it.

pltrdy

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions