"train.py" crush when using flag `--use_mixture_loss`

I run the train.py as follows 

```
CUDA_VISIBLE_DEVICES=0 torchrun  train.py \
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right \
--use_mixture_loss

```
 **and I get** 

```
>> CUDA_VISIBLE_DEVICES=0 torchrun  train.py \                                                                                                                                             [main]
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right \
--use_mixture_loss \

./trainer_1stage.py not exist!
copy ./networks/depth_decoder.py -> ./log/ResNet/exp1/depth_decoder.py
copy ./train_ResNet.sh -> ./log/ResNet/exp1/train_ResNet.sh
train ResNet
use 49 xy planes, 14 xz planes and 0 yz planes.
use DenseAspp Block
use mixture Lap loss
use plane residual
Training model named:
   exp1
Models and tensorboard events files are saved to:
   ./log/ResNet
Training is using:
   cuda
Using split:
   eigen_full_left
There are 22600 training items and 1776 validation items

Training
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Traceback (most recent call last):
  File "/home/bar/projects/PlaneDepth/train.py", line 21, in <module>
    trainer.train()
  File "/home/bar/projects/PlaneDepth/trainer.py", line 248, in train
    self.run_epoch()
  File "/home/bar/projects/PlaneDepth/trainer.py", line 300, in run_epoch
    losses["loss/total_loss"].backward()
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27588) of binary: /home/bar/miniconda3/envs/planedepth/bin/python
Traceback (most recent call last):
  File "/home/bar/miniconda3/envs/planedepth/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-24_18:06:29
  host      : clikaws105
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 27588)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
```

Any other configuration for "train.py" I use without `--use_mixture_loss` run smoothly.
for example, command below runs well.

```
CUDA_VISIBLE_DEVICES=0 torchrun  train.py \
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right
```

Can anyone please help me fix this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"train.py" crush when using flag `--use_mixture_loss` #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"train.py" crush when using flag --use_mixture_loss #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

"train.py" crush when using flag `--use_mixture_loss` #4