zero_optimization.cpu_offload: true leads to a silent crash

I'm experimenting with various `zero_optimization` config options and I noticed that when I flip to `true` `zero_optimization.cpu_offload`, the application exits w/o crashing or doing any training.

```
{
    "train_batch_size": 20,
    "steps_per_print": 2000,

    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    
   "zero_optimization": {
       "stage": 0,
       "allgather_partitions": true,
       "allgather_bucket_size": 500000000,
       "overlap_comm": true,
       "reduce_scatter": true,
       "reduce_bucket_size": 500000000,
       "contiguous_gradients": false,
       "cpu_offload": false
   },

   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 3e-5,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },
   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 3e-5,
       "warmup_num_steps": 500
     }
   },
   "wall_clock_breakdown": false
}
```

leads to a silent exit but doing nothing:

<details>
<summary>Full log</summary>
<pre>

export BS=20; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0 deepspeed  ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
rm: cannot remove 'output_dir': No such file or directory
[2020-12-18 19:42:37,871] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-12-18 19:42:37,897] [INFO] [runner.py:355:main] cmd = /home/stas/anaconda3/envs/main-38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 20 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
[2020-12-18 19:42:38,631] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2020-12-18 19:42:38,631] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=2, node_rank=0
[2020-12-18 19:42:38,631] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2020-12-18 19:42:38,631] [INFO] [launch.py:100:main] dist_world_size=2
[2020-12-18 19:42:38,631] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1
['--deepspeed', '--deepspeed_config', 'ds_config.json']
1
2020-12-18 19:42:40 | WARNING | __main__ | Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 19:42:40 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_19-42-40_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
['--deepspeed', '--deepspeed_config', 'ds_config.json']
0
2020-12-18 19:42:40 | WARNING | __main__ | Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 19:42:40 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_19-42-40_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
[INFO|configuration_utils.py:431] 2020-12-18 19:42:41,139 >> loading configuration file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/3a05b98cd4a37d1704b3d884e5bd1e19a3783d2d0a9f1f5449f4896f4d163781.b57423f4136691c59b9844b9358d5b26655ad2a5e080f0fbb24070bc528d090e
[INFO|configuration_utils.py:467] 2020-12-18 19:42:41,141 >> Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 4,
  "decoder_start_token_id": 250020,
  "do_blenderbot_90_layernorm": false,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "extra_pos_embeddings": 2,
  "force_bos_token_to_be_generated": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_length": 1000,
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 5,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "save_step": 7,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "use_cache": true,
  "variant": "prelayernorm",
  "vocab_size": 250027
}

[INFO|configuration_utils.py:431] 2020-12-18 19:42:41,415 >> loading configuration file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/3a05b98cd4a37d1704b3d884e5bd1e19a3783d2d0a9f1f5449f4896f4d163781.b57423f4136691c59b9844b9358d5b26655ad2a5e080f0fbb24070bc528d090e
[INFO|configuration_utils.py:467] 2020-12-18 19:42:41,417 >> Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 4,
  "decoder_start_token_id": 250020,
  "do_blenderbot_90_layernorm": false,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "extra_pos_embeddings": 2,
  "force_bos_token_to_be_generated": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_length": 1000,
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 5,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "save_step": 7,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "use_cache": true,
  "variant": "prelayernorm",
  "vocab_size": 250027
}

[INFO|tokenization_utils_base.py:1718] 2020-12-18 19:42:41,418 >> Model name 'sshleifer/distill-mbart-en-ro-12-4' not found in model shortcut name list (facebook/mbart-large-en-ro, facebook/mbart-large-cc25). Assuming 'sshleifer/distill-mbart-en-ro-12-4' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/sentencepiece.bpe.model from cache at /home/stas/.cache/huggingface/transformers/62ed1799c9b9a3c199222637281d38762ae87e00165a2613e31c93b3673f08b8.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/special_tokens_map.json from cache at /home/stas/.cache/huggingface/transformers/9423d956f3dd4d8fd97112a8d3f87081f6256ce54ccfecd27938c48e294b8aa8.72fa8565f9c8b5dc27e7ac070020aec80359d9da2e5628b3f313f41bf44d322c
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/tokenizer_config.json from cache at /home/stas/.cache/huggingface/transformers/f5629ec54e86b66e2e9879777df84ce24ede4c93495e6ce9f9161011260c5344.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/tokenizer.json from cache at None
[INFO|tokenization_utils_base.py:925] 2020-12-18 19:42:43,989 >> Assigning ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN'] to the additional_special_tokens key of the tokenizer
[INFO|modeling_utils.py:1024] 2020-12-18 19:42:44,314 >> loading weights file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/pytorch_model.bin from cache at /home/stas/.cache/huggingface/transformers/d2a7ade93d629fb16e06233407ab8aa0e70af5532c66c3b38ce2ff905743bf78.fa8ebf3af9c5dec8982ce624e74de87e85c9a944e776b79b8e8bd65126ed2073
Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/distill-mbart-en-ro-12-4 and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:1045] 2020-12-18 19:43:06,939 >> load time=0.8602
[2020-12-18 19:43:07,280] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 19:43:07,280] [INFO] [engine.py:147:__init__] Initializing torch distributed with backend: nccl
[INFO|modeling_utils.py:1145] 2020-12-18 19:43:07,318 >> All model checkpoint weights were used when initializing MBartForConditionalGeneration.

[WARNING|modeling_utils.py:1147] 2020-12-18 19:43:07,318 >> Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/distill-mbart-en-ro-12-4 and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2020-12-18 19:43:07,512] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 19:43:07,512] [INFO] [engine.py:147:__init__] Initializing torch distributed with backend: nccl
[2020-12-18 19:43:11,225] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-18 19:43:11,229] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000030, betas=(0.800000, 0.999000), weight_decay=0.000000, adam_w=1
[2020-12-18 19:43:13,258] [INFO] [engine.py:702:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000030, betas=(0.800000, 0.999000), weight_decay=0.000000, adam_w=1
[2020-12-18 19:43:13,262] [INFO] [engine.py:593:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2020-12-18 19:43:13,262] [INFO] [engine.py:598:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam (
Parameter Group 0
    amsgrad: False
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    weight_decay: 3e-07
)
[2020-12-18 19:43:13,262] [INFO] [engine.py:702:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-18 19:43:13,262] [INFO] [unfused_optimizer.py:36:__init__] Fused Lamb Legacy : False
group 0 param 0 = 1048576
group 0 param 0 = 1048576

</pre>
</details>

If I flip `zero_optimization.cpu_offload` to `false` everything works:

<details>
<summary>Full log</summary>
<pre>
export BS=20; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0 deepspeed  ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
rm: cannot remove 'output_dir': No such file or directory
[2020-12-18 20:29:55,608] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-12-18 20:29:55,634] [INFO] [runner.py:355:main] cmd = /home/stas/anaconda3/envs/main-38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 20 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
[2020-12-18 20:29:56,371] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2020-12-18 20:29:56,372] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=2, node_rank=0
[2020-12-18 20:29:56,372] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2020-12-18 20:29:56,372] [INFO] [launch.py:100:main] dist_world_size=2
[2020-12-18 20:29:56,372] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1
['--deepspeed', '--deepspeed_config', 'ds_config.json']
1
2020-12-18 20:29:58 | WARNING | __main__ | Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 20:29:58 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_20-29-58_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
['--deepspeed', '--deepspeed_config', 'ds_config.json']
0
2020-12-18 20:29:58 | WARNING | __main__ | Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 20:29:58 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_20-29-58_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
[INFO|configuration_utils.py:431] 2020-12-18 20:29:58,890 >> loading configuration file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/5fd8333015b256440e1b6fbf2d5f86a4868a39440a89554475ee8d1c616d9e56.5b830f48cd63bb457b6ea960d512d839da5b4c30ee8b6998c04977316c32b2f0
[INFO|configuration_utils.py:467] 2020-12-18 20:29:58,892 >> Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 2,
  "decoder_attention_heads": 1,
  "decoder_ffn_dim": 4,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 2,
  "do_blenderbot_90_layernorm": false,
  "dropout": 0.1,
  "encoder_attention_heads": 1,
  "encoder_ffn_dim": 4,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 2,
  "eos_token_id": 2,
  "extra_pos_embeddings": 2,
  "force_bos_token_to_be_generated": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 2,
  "num_hidden_layers": 2,
  "output_past": true,
  "pad_token_id": 1,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "use_cache": true,
  "vocab_size": 250027
}

[INFO|configuration_utils.py:431] 2020-12-18 20:29:59,191 >> loading configuration file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/5fd8333015b256440e1b6fbf2d5f86a4868a39440a89554475ee8d1c616d9e56.5b830f48cd63bb457b6ea960d512d839da5b4c30ee8b6998c04977316c32b2f0
[INFO|configuration_utils.py:467] 2020-12-18 20:29:59,192 >> Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 2,
  "decoder_attention_heads": 1,
  "decoder_ffn_dim": 4,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 2,
  "do_blenderbot_90_layernorm": false,
  "dropout": 0.1,
  "encoder_attention_heads": 1,
  "encoder_ffn_dim": 4,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 2,
  "eos_token_id": 2,
  "extra_pos_embeddings": 2,
  "force_bos_token_to_be_generated": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 2,
  "num_hidden_layers": 2,
  "output_past": true,
  "pad_token_id": 1,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "use_cache": true,
  "vocab_size": 250027
}

[INFO|tokenization_utils_base.py:1718] 2020-12-18 20:29:59,192 >> Model name 'sshleifer/tiny-mbart' not found in model shortcut name list (facebook/mbart-large-en-ro, facebook/mbart-large-cc25). Assuming 'sshleifer/tiny-mbart' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/sentencepiece.bpe.model from cache at /home/stas/.cache/huggingface/transformers/13a2c62c1dabc5357bc38b0694f5829f3db0708d51f1a0f07734f62cc0a825a0.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/special_tokens_map.json from cache at /home/stas/.cache/huggingface/transformers/33fa7894ab257a74cede3060dca6d2fc609918785e80160f6c057723ece47292.0dc5b1041f62041ebbd23b1297f2f573769d5c97d8b7c28180ec86b8f6185aa8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/tokenizer_config.json from cache at /home/stas/.cache/huggingface/transformers/e9c580e6446c42ed20fb148206f2a9bd75a825278ffa029df063682077d45bb6.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/tokenizer.json from cache at None
[INFO|tokenization_utils_base.py:925] 2020-12-18 20:30:01,779 >> Assigning ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN'] to the additional_special_tokens key of the tokenizer
Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/tiny-mbart and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:1024] 2020-12-18 20:30:02,107 >> loading weights file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/pytorch_model.bin from cache at /home/stas/.cache/huggingface/transformers/d6eec704737db03a21a794f08b07fcbb71d855562a992cfb1be6193b37a7ff68.61ce63751e40ea882dd1a22b6c9303b954b81ec69d631ab0541750fd856720be
[INFO|modeling_utils.py:1045] 2020-12-18 20:30:02,150 >> load time=0.0017
[INFO|modeling_utils.py:1145] 2020-12-18 20:30:02,152 >> All model checkpoint weights were used when initializing MBartForConditionalGeneration.

[WARNING|modeling_utils.py:1147] 2020-12-18 20:30:02,152 >> Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/tiny-mbart and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2020-12-18 20:30:02,195] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 20:30:02,195] [INFO] [engine.py:147:__init__] Initializing torch distributed with backend: nccl
[2020-12-18 20:30:02,339] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 20:30:02,339] [INFO] [engine.py:147:__init__] Initializing torch distributed with backend: nccl
[2020-12-18 20:30:05,642] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-18 20:30:05,645] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-18 20:30:05,674] [INFO] [engine.py:593:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2020-12-18 20:30:05,674] [INFO] [engine.py:598:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    weight_decay: 3e-07
)
[2020-12-18 20:30:05,674] [INFO] [engine.py:681:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
[2020-12-18 20:30:05,674] [INFO] [engine.py:681:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
[2020-12-18 20:30:05,677] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    step: 1
    weight_decay: 3e-07
)
[2020-12-18 20:30:05,677] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    step: 1
    weight_decay: 3e-07
)
[2020-12-18 20:30:05,680] [INFO] [engine.py:629:_configure_optimizer] DeepSpeed Final Optimizer = {'dynamic_loss_scale': True, 'cur_scale': 4294967296, 'cur_iter': 0, 'last_overflow_iter': -1, 'scale_factor': 2, 'scale_window': 1000, 'optimizer_state_dict': {'state': {0: {'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:1'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:1')}}, 'param_groups': [{'lr': 3e-05, 'bias_correction': True, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07, 'step': 1, 'params': [0]}]}, 'fp32_groups_flat': [tensor([-3.6163e-02, -1.1017e-02,  1.9646e-03, -9.6741e-03,  0.0000e+00,
         0.0000e+00,  1.9623e-02,  1.2726e-02, -4.2610e-03, -8.0185e-03,
         0.0000e+00,  0.0000e+00, -2.0142e-03, -3.5553e-02, -3.7537e-02,
         3.1891e-02,  0.0000e+00,  0.0000e+00,  1.1742e-02,  2.5101e-02,
        -1.1864e-02, -7.1220e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  2.5635e-02,  1.0338e-02,
        -1.1421e-02, -2.0981e-02, -1.6876e-02, -1.6815e-02, -3.4180e-02,
         3.1799e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         3.6591e-02,  6.4888e-03,  2.2934e-02, -1.4061e-02, -4.8256e-03,
         1.2184e-02, -2.0172e-02, -1.9394e-02,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.2901e-02,
         4.0054e-03,  8.0338e-03, -1.1307e-02,  0.0000e+00,  0.0000e+00,
         2.8641e-02,  4.8184e-04, -1.0582e-02,  1.1536e-02,  0.0000e+00,
         0.0000e+00, -1.0925e-02, -7.4043e-03,  9.5320e-04,  3.4504e-03,
         0.0000e+00,  0.0000e+00,  1.7471e-02,  2.3289e-03,  2.1545e-02,
         2.8915e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00, -3.9185e-02, -1.3550e-02,  2.9087e-03,
         9.9945e-04,  2.0447e-02, -2.4887e-02,  1.3676e-03,  4.8523e-03,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -4.0253e-02,
        -1.5764e-03, -4.0039e-02, -2.2980e-02,  1.1307e-02,  4.4373e-02,
         1.8646e-02, -2.0630e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
         0.0000e+00, -1.5434e-02,  4.0321e-03,  9.0714e-03,  1.0330e-02,
         0.0000e+00,  0.0000e+00, -4.5776e-03, -3.0075e-02,  8.6670e-03,
        -2.1652e-02,  0.0000e+00,  0.0000e+00, -2.4200e-02,  1.8417e-02,
        -2.5970e-02,  9.2010e-03,  0.0000e+00,  0.0000e+00, -8.5220e-03,
        -6.2332e-03, -1.0139e-02, -8.6823e-03,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00, -1.4549e-02,
        -2.5162e-02, -1.4793e-02,  1.6220e-02,  0.0000e+00,  0.0000e+00,
        -2.8320e-02, -2.6138e-02, -1.5015e-02, -5.4893e-03,  0.0000e+00,
         0.0000e+00,  1.1015e-03, -1.5366e-02,  3.3813e-02, -1.7052e-03,
         0.0000e+00,  0.0000e+00,  2.7100e-02,  7.7667e-03, -3.0640e-02,
        -2.1133e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00,  6.5536e-03, -1.3023e-02, -7.0572e-04,
        -1.0208e-02,  6.4087e-03,  5.1575e-03,  1.9257e-02,  2.7344e-02,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -3.2867e-02,
         2.7817e-02, -2.0920e-02,  2.7580e-03, -1.8356e-02, -2.4857e-02,
        -1.5450e-02, -1.2680e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  8.5144e-03, -1.6571e-02,
        -5.7106e-03, -2.2568e-02,  0.0000e+00,  0.0000e+00,  3.8319e-03,
        -1.2337e-02, -1.1345e-02, -4.2847e-02,  0.0000e+00,  0.0000e+00,
        -5.4741e-03, -2.9114e-02,  8.7662e-03,  2.9564e-03,  0.0000e+00,
         0.0000e+00,  1.7075e-02,  1.0483e-02, -2.0325e-02,  3.5675e-02,
         0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
         0.0000e+00, -1.4648e-02, -2.5375e-02,  1.4200e-03, -5.0621e-03,
         0.0000e+00,  0.0000e+00,  2.5284e-02,  1.3382e-02,  5.9319e-03,
        -1.9791e-02,  0.0000e+00,  0.0000e+00,  4.7821e-02,  2.8944e-04,
        -3.6407e-02,  2.6886e-02,  0.0000e+00,  0.0000e+00, -3.4424e-02,
         8.2550e-03, -1.9302e-02,  3.7476e-02,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0750e-02,
        -3.7804e-03,  3.7689e-02, -1.9821e-02, -1.4641e-02,  1.4755e-02,
        -3.3321e-03,  2.1469e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00, -6.6643e-03, -8.9407e-05,  1.4587e-02,  2.7637e-03,
         9.8190e-03,  2.0325e-02, -4.8950e-02, -2.8954e-03,  0.0000e+00,
         0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00], device='cuda:1',
       requires_grad=True)], 'clip_grad': 0.0}
FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    step: 1
    weight_decay: 3e-07
)
<deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fee4132d5e0>
[2020-12-18 20:30:05,681] [INFO] [engine.py:629:_configure_optimizer] DeepSpeed Final Optimizer = {'dynamic_loss_scale': True, 'cur_scale': 4294967296, 'cur_iter': 0, 'last_overflow_iter': -1, 'scale_factor': 2, 'scale_window': 1000, 'optimizer_state_dict': {'state': {0: {'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:0'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:0')}}, 'param_groups': [{'lr': 3e-05, 'bias_correction': True, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07, 'step': 1, 'params': [0]}]}, 'fp32_groups_flat': [tensor([-3.6163e-02, -1.1017e-02,  1.9646e-03, -9.6741e-03,  0.0000e+00,
         0.0000e+00,  1.9623e-02,  1.2726e-02, -4.2610e-03, -8.0185e-03,
         0.0000e+00,  0.0000e+00, -2.0142e-03, -3.5553e-02, -3.7537e-02,
         3.1891e-02,  0.0000e+00,  0.0000e+00,  1.1742e-02,  2.5101e-02,
        -1.1864e-02, -7.1220e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  2.5635e-02,  1.0338e-02,
        -1.1421e-02, -2.0981e-02, -1.6876e-02, -1.6815e-02, -3.4180e-02,
         3.1799e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         3.6591e-02,  6.4888e-03,  2.2934e-02, -1.4061e-02, -4.8256e-03,
         1.2184e-02, -2.0172e-02, -1.9394e-02,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.2901e-02,
         4.0054e-03,  8.0338e-03, -1.1307e-02,  0.0000e+00,  0.0000e+00,
         2.8641e-02,  4.8184e-04, -1.0582e-02,  1.1536e-02,  0.0000e+00,
         0.0000e+00, -1.0925e-02, -7.4043e-03,  9.5320e-04,  3.4504e-03,
         0.0000e+00,  0.0000e+00,  1.7471e-02,  2.3289e-03,  2.1545e-02,
         2.8915e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00, -3.9185e-02, -1.3550e-02,  2.9087e-03,
         9.9945e-04,  2.0447e-02, -2.4887e-02,  1.3676e-03,  4.8523e-03,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -4.0253e-02,
        -1.5764e-03, -4.0039e-02, -2.2980e-02,  1.1307e-02,  4.4373e-02,
         1.8646e-02, -2.0630e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
         0.0000e+00, -1.5434e-02,  4.0321e-03,  9.0714e-03,  1.0330e-02,
         0.0000e+00,  0.0000e+00, -4.5776e-03, -3.0075e-02,  8.6670e-03,
        -2.1652e-02,  0.0000e+00,  0.0000e+00, -2.4200e-02,  1.8417e-02,
        -2.5970e-02,  9.2010e-03,  0.0000e+00,  0.0000e+00, -8.5220e-03,
        -6.2332e-03, -1.0139e-02, -8.6823e-03,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00, -1.4549e-02,
        -2.5162e-02, -1.4793e-02,  1.6220e-02,  0.0000e+00,  0.0000e+00,
        -2.8320e-02, -2.6138e-02, -1.5015e-02, -5.4893e-03,  0.0000e+00,
         0.0000e+00,  1.1015e-03, -1.5366e-02,  3.3813e-02, -1.7052e-03,
         0.0000e+00,  0.0000e+00,  2.7100e-02,  7.7667e-03, -3.0640e-02,
        -2.1133e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00,  6.5536e-03, -1.3023e-02, -7.0572e-04,
        -1.0208e-02,  6.4087e-03,  5.1575e-03,  1.9257e-02,  2.7344e-02,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -3.2867e-02,
         2.7817e-02, -2.0920e-02,  2.7580e-03, -1.8356e-02, -2.4857e-02,
        -1.5450e-02, -1.2680e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  8.5144e-03, -1.6571e-02,
        -5.7106e-03, -2.2568e-02,  0.0000e+00,  0.0000e+00,  3.8319e-03,
        -1.2337e-02, -1.1345e-02, -4.2847e-02,  0.0000e+00,  0.0000e+00,
        -5.4741e-03, -2.9114e-02,  8.7662e-03,  2.9564e-03,  0.0000e+00,
         0.0000e+00,  1.7075e-02,  1.0483e-02, -2.0325e-02,  3.5675e-02,
         0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
         0.0000e+00, -1.4648e-02, -2.5375e-02,  1.4200e-03, -5.0621e-03,
         0.0000e+00,  0.0000e+00,  2.5284e-02,  1.3382e-02,  5.9319e-03,
        -1.9791e-02,  0.0000e+00,  0.0000e+00,  4.7821e-02,  2.8944e-04,
        -3.6407e-02,  2.6886e-02,  0.0000e+00,  0.0000e+00, -3.4424e-02,
         8.2550e-03, -1.9302e-02,  3.7476e-02,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0750e-02,
        -3.7804e-03,  3.7689e-02, -1.9821e-02, -1.4641e-02,  1.4755e-02,
        -3.3321e-03,  2.1469e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00, -6.6643e-03, -8.9407e-05,  1.4587e-02,  2.7637e-03,
         9.8190e-03,  2.0325e-02, -4.8950e-02, -2.8954e-03,  0.0000e+00,
         0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00], device='cuda:0',
       requires_grad=True)], 'clip_grad': 0.0}
[2020-12-18 20:30:05,681] [INFO] [engine.py:457:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2020-12-18 20:30:05,681] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7f303160d640>
[2020-12-18 20:30:05,681] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-05], mom=[[0.8, 0.999]]
[2020-12-18 20:30:05,681] [INFO] [config.py:644:print] DeepSpeedEngine configuration:
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f303160db50>
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   allreduce_always_fp32 ........ False
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   amp_enabled .................. False
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   amp_params ................... False
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   disable_allgather ............ False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   dump_state ................... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   fp16_enabled ................. True
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   global_rank .................. 0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   gradient_accumulation_steps .. 1
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   gradient_clipping ............ 0.0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   gradient_predivide_factor .... 1.0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   initial_dynamic_scale ........ 4294967296
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   loss_scale ................... 0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   memory_breakdown ............. False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   optimizer_legacy_fusion ...... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   optimizer_name ............... adam
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   optimizer_params ............. {'lr': 3e-05, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07, 'adam_w_mode': True}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   pld_enabled .................. False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   pld_params ................... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   prescale_gradients ........... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   scheduler_name ............... WarmupLR
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 3e-05, 'warmup_num_steps': 500}
2020-12-18 20:30:05 | INFO | __main__ | *** Train ***
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   sparse_attention ............. None
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   sparse_gradients_enabled ..... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   steps_per_print .............. 2000
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   tensorboard_enabled .......... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   tensorboard_output_path ...... 
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   train_batch_size ............. 20
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   train_micro_batch_size_per_gpu  10
2020-12-18 20:30:05 | WARNING | seq2seq_trainer | scheduler is passed to `Seq2SeqTrainer`, `--lr_scheduler` arg is ignored.
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   wall_clock_breakdown ......... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   world_size ................... 2
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_allow_untested_optimizer  False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_config .................. {
    "allgather_bucket_size": 500000000,
    "allgather_partitions": true,
    "contiguous_gradients": true,
    "cpu_offload": false,
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": false,
    "reduce_bucket_size": 500000000,
    "reduce_scatter": false,
    "stage": 0
}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_enabled ................. False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_optimization_stage ...... 0
[2020-12-18 20:30:05,682] [INFO] [config.py:650:print]   json = {
    "fp16":{
        "enabled":true,
        "hysteresis":2,
        "loss_scale":0,
        "loss_scale_window":1000,
        "min_loss_scale":1
    },
    "optimizer":{
        "params":{
            "adam_w_mode":true,
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":3e-05,
            "weight_decay":3e-07
        },
        "type":"Adam"
    },
    "scheduler":{
        "params":{
            "warmup_max_lr":3e-05,
            "warmup_min_lr":0,
            "warmup_num_steps":500
        },
        "type":"WarmupLR"
    },
    "steps_per_print":2000,
    "train_batch_size":20,
    "wall_clock_breakdown":false,
    "zero_optimization":{
        "allgather_bucket_size":500000000,
        "allgather_partitions":true,
        "contiguous_gradients":true,
        "cpu_offload":false,
        "overlap_comm":false,
        "reduce_bucket_size":500000000,
        "reduce_scatter":false,
        "stage":0
    }
}
FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    step: 1
    weight_decay: 3e-07
)
<deepspeed.runtime.lr_schedules.WarmupLR object at 0x7f303160d640>
2020-12-18 20:30:05 | INFO | __main__ | *** Train ***
2020-12-18 20:30:05 | WARNING | seq2seq_trainer | scheduler is passed to `Seq2SeqTrainer`, `--lr_scheduler` arg is ignored.
[INFO|trainer.py:723] 2020-12-18 20:30:05,688 >> ***** Running training *****
[INFO|trainer.py:724] 2020-12-18 20:30:05,688 >>   Num examples = 500
[INFO|trainer.py:725] 2020-12-18 20:30:05,688 >>   Num Epochs = 1
[INFO|trainer.py:726] 2020-12-18 20:30:05,688 >>   Instantaneous batch size per device = 20
[INFO|trainer.py:727] 2020-12-18 20:30:05,688 >>   Total train batch size (w. parallel, distributed & accumulation) = 40
[INFO|trainer.py:728] 2020-12-18 20:30:05,688 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:729] 2020-12-18 20:30:05,688 >>   Total optimization steps = 13
{'loss': inf, 'learning_rate': 0.0, 'epoch': 0.07692307692307693}
 92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍         | 12/13 [00:02<00:00,  5.65it/s][INFO|trainer.py:883] 2020-12-18 20:30:08,588 >>

Training completed. Do not forget to share your model on huggingface.co/models =)


{'epoch': 1.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00,  5.95it/s]
[INFO|trainer.py:1247] 2020-12-18 20:30:08,589 >> Saving model checkpoint to output_dir
[INFO|trainer.py:1251] 2020-12-18 20:30:08,589 >> Trainer.model is not a `PreTrainedModel`, only saving its state dict.
2020-12-18 20:30:08 | INFO | __main__ | ***** train metrics *****
2020-12-18 20:30:08 | INFO | __main__ |   train_samples_per_second = 172.096
2020-12-18 20:30:08 | INFO | __main__ |   train_runtime = 2.9054
2020-12-18 20:30:08 | INFO | __main__ |   train_n_ojbs = 500
</pre>
</details>

I know I haven't provided reproduction info, as I haven't quite finished working on integration with HF `transformers`, but it should be ready soon. I was hoping you could tell from logs what went wrong. But if it isn't helpful I will update this Issue with reproduction details once I have a transformers branch you could experiment with.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

zero_optimization.cpu_offload: true leads to a silent crash #610

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

zero_optimization.cpu_offload: true leads to a silent crash #610

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions