Skip to content

[master] RuntimeError: stack expects a non-empty TensorList #364

@stas00

Description

@stas00

v0.1.5 works fine, but on master getting this:

Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 304, in main
    train_result = trainer.train(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 958, in train
    self.optimizer.clip_grad_norm(self.args.max_grad_norm)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/fairscale/optim/oss.py", line 284, in clip_grad_norm
    input=torch.stack([torch.norm(input=p.grad.detach(), p=norm_type, dtype=torch.float32).to(self._device) for p in local_params]),  # type: ignore
RuntimeError: stack expects a non-empty TensorList

The problem comes from this last commit 7fdd7ec - reverting it fixes the problem.

This is running the same old test command of HF trainer w/ fairscale via sharded_ddp:

cd transformers
cd examples/seq2seq
export BS=4; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0   python3 -m torch.distributed.launch --nproc_per_node=2  ./finetune_trainer.py --model_name_or_path  sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 50 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --sharded_ddp --fp16

env:

PyTorch version: 1.8.0.dev20210202+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 10.0.0-4ubuntu1 
CMake version: version 3.18.2

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce GTX 1070 Ti
GPU 1: GeForce RTX 3090

Nvidia driver version: 455.45.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] pytorch-lightning==1.1.1rc0
[pip3] pytorch-memlab==0.2.2
[pip3] torch==1.8.0.dev20210202+cu110
[pip3] torchtext==0.6.0
[pip3] torchvision==0.9.0a0+f80b83e
[pip3] torchviz==0.0.1
[conda] blas                      1.0                         mkl
[conda] magma-cuda111             2.5.2                         1    pytorch
[conda] mkl                       2020.2                      256
[conda] mkl-include               2020.2                      256
[conda] mkl-service               2.3.0            py38he904b0f_0
[conda] mkl_fft                   1.2.0            py38h23d657b_0
[conda] mkl_random                1.1.1            py38h0573a6f_0
[conda] numpy                     1.18.5                   pypi_0    pypi
[conda] pytorch-lightning         1.1.1rc0                  dev_0    <develop>
[conda] pytorch-memlab            0.2.2                    pypi_0    pypi
[conda] torch                     1.8.0a0+17f8c32          pypi_0    pypi
[conda] torchtext                 0.6.0                    pypi_0    pypi
[conda] torchvision               0.9.0a0+f80b83e           dev_0    <develop>
[conda] torchviz                  0.0.1                    pypi_0    pypi

@blefaudeux

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions