Skip to content

"train.py" crush when using flag --use_mixture_loss #4

@BarRozenman

Description

@BarRozenman

I run the train.py as follows

CUDA_VISIBLE_DEVICES=0 torchrun  train.py \
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right \
--use_mixture_loss

and I get

>> CUDA_VISIBLE_DEVICES=0 torchrun  train.py \                                                                                                                                             [main]
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right \
--use_mixture_loss \

./trainer_1stage.py not exist!
copy ./networks/depth_decoder.py -> ./log/ResNet/exp1/depth_decoder.py
copy ./train_ResNet.sh -> ./log/ResNet/exp1/train_ResNet.sh
train ResNet
use 49 xy planes, 14 xz planes and 0 yz planes.
use DenseAspp Block
use mixture Lap loss
use plane residual
Training model named:
   exp1
Models and tensorboard events files are saved to:
   ./log/ResNet
Training is using:
   cuda
Using split:
   eigen_full_left
There are 22600 training items and 1776 validation items

Training
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Traceback (most recent call last):
  File "/home/bar/projects/PlaneDepth/train.py", line 21, in <module>
    trainer.train()
  File "/home/bar/projects/PlaneDepth/trainer.py", line 248, in train
    self.run_epoch()
  File "/home/bar/projects/PlaneDepth/trainer.py", line 300, in run_epoch
    losses["loss/total_loss"].backward()
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27588) of binary: /home/bar/miniconda3/envs/planedepth/bin/python
Traceback (most recent call last):
  File "/home/bar/miniconda3/envs/planedepth/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-24_18:06:29
  host      : clikaws105
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 27588)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Any other configuration for "train.py" I use without --use_mixture_loss run smoothly.
for example, command below runs well.

CUDA_VISIBLE_DEVICES=0 torchrun  train.py \
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right

Can anyone please help me fix this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions