-
Notifications
You must be signed in to change notification settings - Fork 296
[master] RuntimeError: stack expects a non-empty TensorList #364
Copy link
Copy link
Closed
Description
v0.1.5 works fine, but on master getting this:
Traceback (most recent call last):
File "./finetune_trainer.py", line 367, in <module>
main()
File "./finetune_trainer.py", line 304, in main
train_result = trainer.train(
File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 958, in train
self.optimizer.clip_grad_norm(self.args.max_grad_norm)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/fairscale/optim/oss.py", line 284, in clip_grad_norm
input=torch.stack([torch.norm(input=p.grad.detach(), p=norm_type, dtype=torch.float32).to(self._device) for p in local_params]), # type: ignore
RuntimeError: stack expects a non-empty TensorList
The problem comes from this last commit 7fdd7ec - reverting it fixes the problem.
This is running the same old test command of HF trainer w/ fairscale via sharded_ddp:
cd transformers
cd examples/seq2seq
export BS=4; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0 python3 -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 50 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --sharded_ddp --fp16
env:
PyTorch version: 1.8.0.dev20210202+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.18.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1070 Ti
GPU 1: GeForce RTX 3090
Nvidia driver version: 455.45.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] pytorch-lightning==1.1.1rc0
[pip3] pytorch-memlab==0.2.2
[pip3] torch==1.8.0.dev20210202+cu110
[pip3] torchtext==0.6.0
[pip3] torchvision==0.9.0a0+f80b83e
[pip3] torchviz==0.0.1
[conda] blas 1.0 mkl
[conda] magma-cuda111 2.5.2 1 pytorch
[conda] mkl 2020.2 256
[conda] mkl-include 2020.2 256
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.2.0 py38h23d657b_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.18.5 pypi_0 pypi
[conda] pytorch-lightning 1.1.1rc0 dev_0 <develop>
[conda] pytorch-memlab 0.2.2 pypi_0 pypi
[conda] torch 1.8.0a0+17f8c32 pypi_0 pypi
[conda] torchtext 0.6.0 pypi_0 pypi
[conda] torchvision 0.9.0a0+f80b83e dev_0 <develop>
[conda] torchviz 0.0.1 pypi_0 pypi
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels