-
Notifications
You must be signed in to change notification settings - Fork 6
Closed
Description
I run the train.py as follows
CUDA_VISIBLE_DEVICES=0 torchrun train.py \
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right \
--use_mixture_loss
and I get
>> CUDA_VISIBLE_DEVICES=0 torchrun train.py \ [main]
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right \
--use_mixture_loss \
./trainer_1stage.py not exist!
copy ./networks/depth_decoder.py -> ./log/ResNet/exp1/depth_decoder.py
copy ./train_ResNet.sh -> ./log/ResNet/exp1/train_ResNet.sh
train ResNet
use 49 xy planes, 14 xz planes and 0 yz planes.
use DenseAspp Block
use mixture Lap loss
use plane residual
Training model named:
exp1
Models and tensorboard events files are saved to:
./log/ResNet
Training is using:
cuda
Using split:
eigen_full_left
There are 22600 training items and 1776 validation items
Training
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Traceback (most recent call last):
File "/home/bar/projects/PlaneDepth/train.py", line 21, in <module>
trainer.train()
File "/home/bar/projects/PlaneDepth/trainer.py", line 248, in train
self.run_epoch()
File "/home/bar/projects/PlaneDepth/trainer.py", line 300, in run_epoch
losses["loss/total_loss"].backward()
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27588) of binary: /home/bar/miniconda3/envs/planedepth/bin/python
Traceback (most recent call last):
File "/home/bar/miniconda3/envs/planedepth/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-24_18:06:29
host : clikaws105
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 27588)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Any other configuration for "train.py" I use without --use_mixture_loss run smoothly.
for example, command below runs well.
CUDA_VISIBLE_DEVICES=0 torchrun train.py \
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right
Can anyone please help me fix this?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels