Skip to content

training the 20 and 8 billion model failed on SUMMIT #115

@agemagician

Description

@agemagician

Hello,

I am trying to train the 8 billion and the 20 billion models on SUMMIT and both failed.
SUMMIT has 6 Nvidia V100 16GB GPUs per node.
Both the 8 billion and the 20 billion give oom.

The training command is:

export MP_SIZE=6

jsrun -n${NODES} -a6 -c42 -g6 -r1 --smpiargs $SMPIARGS python pretrain_bert.py --sharedfile=$SHAREDFILE \
       --deepspeed_mpi --deepspeed --deepspeed_config ${DS_CONFIG} \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 100 \
       --hidden-size 3720 \
       --num-attention-heads 30 \
       --batch-size 1 \
       --seq-length 512 \
       --max-preds-per-seq 76 \
       --max-position-embeddings 512 \
       --train-iters 1000000 \
       --save ${SAVEPATH} \
       --use-tfrecords \
       --train-data ${TRAINDATAPATH} \
       --tokenizer-type BertWordPieceTokenizer \
       --tokenizer-model-type ${VOCABPATH} \
       --presplit-sentences \
       --cache-dir ${CACHEPATH} \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.0001 \
       --lr-decay-style linear \
       --lr-decay-iters 990000 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --fp16 \
       --fp32-layernorm \
       --fp32-embedding \
       --vocab-size 30 \
       --make-vocab-size-divisible-by 5 \
       --checkpoint-activations \
       --checkpoint-num-layers 1

jsrun -n${NODES} -a6 -c42 -g6 -r1 --smpiargs $SMPIARGS python pretrain_bert_nccl.py --sharedfile=$SHAREDFILE \
       --deepspeed_mpi --deepspeed --deepspeed_config ${DS_CONFIG} \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 72 \
       --hidden-size 3072 \
       --num-attention-heads 24 \
       --batch-size 1 \
       --seq-length 512 \
       --max-preds-per-seq 76 \
       --max-position-embeddings 512 \
       --train-iters 1000000 \
       --save ${SAVEPATH} \
       --use-tfrecords \
       --train-data ${TRAINDATAPATH} \
       --tokenizer-type BertWordPieceTokenizer \
       --tokenizer-model-type ${VOCABPATH} \
       --presplit-sentences \
       --cache-dir ${CACHEPATH} \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.0001 \
       --lr-decay-style linear \
       --lr-decay-iters 990000 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --fp16 \
       --fp32-layernorm \
       --fp32-embedding \
       --vocab-size 30 \
       --make-vocab-size-divisible-by 5 \
       --checkpoint-activations \
       --checkpoint-num-layers 1

The config file is:

{
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "zero_optimization": true,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "max_grad_norm": 1.0
    }
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  } 
}

I am testing it on 1 node and even after I reduced the train batch size to 1, it didn't work:


The logs are:
  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 949,50,1
  test_data .................... None
  lazy_loader .................. False
  loose_json ................... False
  presplit_sentences ........... True
  num_workers .................. 2
  tokenizer_model_type ......... /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... BertWordPieceTokenizer
  cache_dir .................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
  use_tfrecords ................ True
  seq_length ................... 512
  max_preds_per_seq ............ 76
  deepspeed .................... True
  deepspeed_config ............. /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ds_bert_config.json
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ True
  sharedfile ................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/test/.sharedfile
  cuda ......................... True
  rank ......................... 0
  world_size ................... 6
  dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
2020-02-29 04:40:19.647170: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0229 04:40:22.566024 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:46: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0229 04:40:22.567073 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:55: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

W0229 04:40:22.567220 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:66: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-02-29 04:40:22.567455: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-02-29 04:40:22.570236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-29 04:40:22.572765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-29 04:40:22.575278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-29 04:40:22.577850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-29 04:40:22.580415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-29 04:40:22.582986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-29 04:40:22.583008: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-02-29 04:40:22.583068: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-02-29 04:40:22.583108: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-02-29 04:40:22.583146: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-02-29 04:40:22.585072: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-02-29 04:40:22.585118: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-02-29 04:40:22.585156: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-02-29 04:40:22.615387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-29 04:40:22.623295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-29 04:40:22.623314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      
W0229 04:40:22.646660 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0229 04:40:25.123421 35184372395936 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0229 04:40:25.123578 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W0229 04:40:25.123658 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
2020-02-29 04:40:25.149839: W tensorflow/core/common_runtime/eager/context.cc:371] Added two functions with the same name: __inference_Dataset_flat_map_read_one_file_28
W0229 04:40:25.153336 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:96: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
W0229 04:40:25.153439 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
W0229 04:40:25.154995 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:116: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

W0229 04:40:25.166115 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:119: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
configuring data
loading BertWordPieceTokenizer ( /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ ) from cache_dir  /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
loaded /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
> padded vocab (size: 30) with 0 dummy tokens (new size: 30)
h36n18:125722:125722 [0] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125722:125722 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125722:125722 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
NCCL version 2.4.7nvb1+cuda10.1
h36n18:125724:125724 [2] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125724:125724 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125726:125726 [4] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125726:125726 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125727:125727 [5] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125727:125727 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125723:125723 [1] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125723:125723 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125722:125971 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:125725:125725 [3] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125725:125725 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125725:125725 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125726:125726 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125724:125724 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125723:125723 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125727:125727 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125725:125992 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:125723:125993 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:125724:125994 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:125726:125995 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:125727:125996 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:125722:125971 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:125722:125971 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:125726:125995 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO comm 0x200104006650 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:125725:125992 [3] NCCL INFO comm 0x200104006650 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:125724:125994 [2] NCCL INFO comm 0x200104006650 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
h36n18:125722:125971 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:125722:125971 [0] NCCL INFO comm 0x20040c006650 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:125722:125722 [0] NCCL INFO Launch mode Parallel
building BERT model ...
h36n18:125726:125995 [4] NCCL INFO comm 0x200104006650 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:125723:125993 [1] NCCL INFO comm 0x200104006650 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
 > number of parameters on model parallel rank 0: 2799983247
h36n18:125722:126579 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:125722:126579 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:125722:126579 [0] NCCL INFO comm 0x200404006620 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
 > number of parameters on model parallel rank 5: 2799983247
 > number of parameters on model parallel rank 3: 2799983247
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 579, in main
    model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
  File "pretrain_bert_nccl.py", line 170, in setup_model_and_optimizer
    optimizer = get_optimizer(model, args)
  File "pretrain_bert_nccl.py", line 141, in get_optimizer
    'delayed_shift': args.hysteresis})
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 198, in __init__
    master_param = param.detach().clone().float()
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 15.75 GiB total capacity; 14.50 GiB already allocated; 16.94 MiB free; 373.95 MiB cached; 0 bytes inactive)
 > number of parameters on model parallel rank 2: 2799983247
 > number of parameters on model parallel rank 1: 2799983247
 > number of parameters on model parallel rank 4: 2799983247

  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 949,50,1
  test_data .................... None
  lazy_loader .................. False
  loose_json ................... False
  presplit_sentences ........... True
  num_workers .................. 2
  tokenizer_model_type ......... /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... BertWordPieceTokenizer
  cache_dir .................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
  use_tfrecords ................ True
  seq_length ................... 512
  max_preds_per_seq ............ 76
  deepspeed .................... True
  deepspeed_config ............. /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ds_bert_config.json
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ True
  sharedfile ................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/test/.sharedfile
  cuda ......................... True
  rank ......................... 0
  world_size ................... 6
  dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
2020-02-29 05:07:35.425203: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0229 05:07:38.074505 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:46: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0229 05:07:38.074888 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:55: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

W0229 05:07:38.075031 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:66: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-02-29 05:07:38.075261: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-02-29 05:07:38.078041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-29 05:07:38.080565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-29 05:07:38.083095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-29 05:07:38.085669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-29 05:07:38.088239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-29 05:07:38.090805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-29 05:07:38.090827: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-02-29 05:07:38.090887: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-02-29 05:07:38.090926: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-02-29 05:07:38.090965: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-02-29 05:07:38.092861: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-02-29 05:07:38.092907: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-02-29 05:07:38.092946: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-02-29 05:07:38.123406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-29 05:07:38.130912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-29 05:07:38.130926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      
W0229 05:07:38.154345 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0229 05:07:39.526942 35184372395936 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0229 05:07:39.527102 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W0229 05:07:39.527187 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
2020-02-29 05:07:39.553327: W tensorflow/core/common_runtime/eager/context.cc:371] Added two functions with the same name: __inference_Dataset_flat_map_read_one_file_28
W0229 05:07:39.556849 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:96: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
W0229 05:07:39.556953 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
W0229 05:07:39.559207 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:116: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

W0229 05:07:39.570396 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:119: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
configuring data
loading BertWordPieceTokenizer ( /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ ) from cache_dir  /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
loaded /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
> padded vocab (size: 30) with 0 dummy tokens (new size: 30)
h36n18:127714:127714 [0] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127714:127714 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127714:127714 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127718:127718 [4] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127718:127718 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127719:127719 [5] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127719:127719 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127717:127717 [3] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127717:127717 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127714:127963 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127716:127716 [2] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127715:127715 [1] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127716:127716 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127715:127715 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127715:127715 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127719:127719 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127716:127716 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127717:127717 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127718:127718 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127716:127984 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127715:127985 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127719:127986 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127717:127987 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127718:127988 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127714:127963 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:127714:127963 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:127715:127985 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO comm 0x200104006650 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:127717:127987 [3] NCCL INFO comm 0x200104006650 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:127716:127984 [2] NCCL INFO comm 0x200104006650 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
h36n18:127714:127963 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:127714:127963 [0] NCCL INFO comm 0x20040c006650 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:127714:127714 [0] NCCL INFO Launch mode Parallel
h36n18:127715:127985 [1] NCCL INFO comm 0x200104006650 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
h36n18:127718:127988 [4] NCCL INFO comm 0x200104006650 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
building BERT model ...
 > number of parameters on model parallel rank 0: 1381032967
 > number of parameters on model parallel rank 1: 1381032967
 > number of parameters on model parallel rank 5: 1381032967
 > number of parameters on model parallel rank 3: 1381032967
 > number of parameters on model parallel rank 2: 1381032967
h36n18:127714:128267 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127714:128267 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127714:128267 [0] NCCL INFO comm 0x200404006620 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
 > number of parameters on model parallel rank 4: 1381032967
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127715:128279 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127715:128279 [1] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127715:128279 [1] NCCL INFO comm 0x2001c8006620 rank 0 nranks 1 cudaDev 1 nvmlDev 1 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127719:128281 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127719:128281 [5] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127719:128281 [5] NCCL INFO comm 0x2001ec006620 rank 0 nranks 1 cudaDev 5 nvmlDev 5 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127716:128283 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127716:128283 [2] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127716:128283 [2] NCCL INFO comm 0x200340006620 rank 0 nranks 1 cudaDev 2 nvmlDev 2 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127717:128286 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127717:128286 [3] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127717:128286 [3] NCCL INFO comm 0x200320006620 rank 0 nranks 1 cudaDev 3 nvmlDev 3 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127718:128288 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127718:128288 [4] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127718:128288 [4] NCCL INFO comm 0x2001f4006620 rank 0 nranks 1 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:127714:128336 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127718:128337 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127719:128338 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127715:128339 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127717:128341 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127716:128340 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127714:128336 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:127714:128336 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:127719:128338 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO comm 0x200408006620 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:127715:128339 [1] NCCL INFO comm 0x200424006620 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
h36n18:127718:128337 [4] NCCL INFO comm 0x200410006620 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:127717:128341 [3] NCCL INFO comm 0x20033c006620 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:127714:128336 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:127714:128336 [0] NCCL INFO comm 0x200718006620 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:127714:127714 [0] NCCL INFO Launch mode Parallel
h36n18:127716:128340 [2] NCCL INFO comm 0x20035c006620 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
learning rate decaying linear
Partition Activations False and Correctness Check False
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.75 GiB total capacity; 14.04 GiB already allocated; 580.94 MiB free; 200.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 2; 15.75 GiB total capacity; 14.13 GiB already allocated; 586.94 MiB free; 188.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 1; 15.75 GiB total capacity; 14.13 GiB already allocated; 582.88 MiB free; 192.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 5; 15.75 GiB total capacity; 14.16 GiB already allocated; 554.94 MiB free; 196.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 4; 15.75 GiB total capacity; 14.16 GiB already allocated; 554.94 MiB free; 196.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 3; 15.75 GiB total capacity; 14.16 GiB already allocated; 558.94 MiB free; 192.72 MiB cached; 0 bytes inactive)

From my understanding from the paper on table 8 that you were able to train both the 8 and 20 billion models on 4 x 16GB GPU using 4 way model parallelism.
In my case I am using 6 way model parallelism with batch size 1 and it dosn't work.

Did I miss understood something?
Do you have any idea how to make it work ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions