i use the this code to train a model , i find the memory which is increasing in every batch. that lead to out memory in some epoch.
script as follow:
sbatch --partition airesearch_middle --job-name mem_fairseq-py --gres gpu:4 --cpus-per-task 10
--nodes 1 --ntasks-per-node 1
--wrap "srun --output ${savedir}/train.log.node%t --error ${savedir}/train.stderr.node%t.%j
python train.py $DATA
--distributed-world-size 4
--save-dir=${savedir}
--update-freq 16
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000
--lr 0.0005 --min-lr 1e-09
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1
--max-tokens 1000 "
script to get memory as follow:
#!/usr/bin/env python
def print_mem(itnum,bnum):
import os
import psutil
pid = os.getpid()
py = psutil.Process(pid)
memoryUse = py.memory_info()[0]/2.**20
return 'iteration: {} batchnum {} memory use: {}MB'.format(itnum, bnum, memoryUse)
log as follow:
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, clip_norm=0.0, criterion='label_smoothed_cross_entropy', data='data-bin/wmt14_en_de_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://localhost:16693', distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=False, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=1000, max_update=0, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, optimizer='adam', raw_text=False, relu_dropout=0.0, restore_file='checkpoint_last.pt', save_dir='./checkpoints/transformer_vaswani_wmt_en_de_big/mem_single_node_multi_4gpus/', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', train_subset='train', update_freq=[16], valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 32768 types
| [de] dictionary: 32768 types
Load dataset splits
| data-bin/wmt14_en_de_joined_dict train 4528446 examples
| data-bin/wmt14_en_de_joined_dict valid 3000 examples
Build model and criterion
| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 209911808
| training on 4 GPUs
| max tokens per GPU = 1000 and max sentences per GPU = None
iteration: 1 batchnum 15 memory use: 3760.8203125MB
iteration: 1 batchnum 31 memory use: 3761.0703125MB
iteration: 1 batchnum 47 memory use: 3761.12109375MB
iteration: 1 batchnum 63 memory use: 3761.234375MB
iteration: 1 batchnum 79 memory use: 3761.3203125MB
iteration: 1 batchnum 95 memory use: 3761.40625MB
iteration: 1 batchnum 111 memory use: 3761.453125MB
iteration: 1 batchnum 127 memory use: 3761.5546875MB
iteration: 1 batchnum 143 memory use: 3761.62890625MB
iteration: 1 batchnum 159 memory use: 3761.69140625MB
iteration: 1 batchnum 175 memory use: 3761.7578125MB
iteration: 1 batchnum 191 memory use: 3761.87109375MB
iteration: 1 batchnum 207 memory use: 3761.92578125MB
iteration: 1 batchnum 223 memory use: 3762.00390625MB
iteration: 1 batchnum 239 memory use: 3762.08203125MB
iteration: 1 batchnum 255 memory use: 3762.16015625MB
iteration: 1 batchnum 271 memory use: 3762.2421875MB
iteration: 1 batchnum 287 memory use: 3762.3203125MB
iteration: 1 batchnum 303 memory use: 3762.390625MB
iteration: 1 batchnum 319 memory use: 3762.47265625MB
iteration: 1 batchnum 335 memory use: 3762.5546875MB
iteration: 1 batchnum 351 memory use: 3762.63671875MB
iteration: 1 batchnum 367 memory use: 3762.7109375MB
iteration: 1 batchnum 383 memory use: 3762.75390625MB
iteration: 1 batchnum 399 memory use: 3762.828125MB
iteration: 1 batchnum 415 memory use: 3762.87890625MB
iteration: 1 batchnum 431 memory use: 3762.98828125MB
iteration: 1 batchnum 447 memory use: 3763.06640625MB
iteration: 1 batchnum 463 memory use: 3763.12890625MB
iteration: 1 batchnum 479 memory use: 3763.17578125MB
iteration: 1 batchnum 495 memory use: 3763.265625MB
iteration: 1 batchnum 511 memory use: 3763.33203125MB
iteration: 1 batchnum 527 memory use: 3763.4296875MB
iteration: 1 batchnum 543 memory use: 3763.47265625MB
iteration: 1 batchnum 559 memory use: 3763.5625MB
iteration: 1 batchnum 575 memory use: 3763.640625MB
iteration: 1 batchnum 591 memory use: 3763.72265625MB
iteration: 1 batchnum 607 memory use: 3763.81640625MB
iteration: 1 batchnum 623 memory use: 3763.8828125MB
iteration: 1 batchnum 639 memory use: 3763.94921875MB
iteration: 1 batchnum 655 memory use: 3764.02734375MB
iteration: 1 batchnum 671 memory use: 3764.125MB
iteration: 1 batchnum 687 memory use: 3764.16015625MB
iteration: 1 batchnum 703 memory use: 3764.3046875MB
iteration: 1 batchnum 719 memory use: 3764.33984375MB
iteration: 1 batchnum 735 memory use: 3764.4296875MB
iteration: 1 batchnum 751 memory use: 3764.51171875MB
iteration: 1 batchnum 767 memory use: 3764.578125MB
iteration: 1 batchnum 783 memory use: 3764.66015625MB
iteration: 1 batchnum 799 memory use: 3764.72265625MB
iteration: 1 batchnum 815 memory use: 3764.796875MB
iteration: 1 batchnum 831 memory use: 3764.86328125MB
iteration: 1 batchnum 847 memory use: 3764.9375MB
iteration: 1 batchnum 863 memory use: 3765.02734375MB
iteration: 1 batchnum 879 memory use: 3765.125MB
iteration: 1 batchnum 895 memory use: 3765.17578125MB
iteration: 1 batchnum 911 memory use: 3765.2890625MB
iteration: 1 batchnum 927 memory use: 3765.3359375MB
iteration: 1 batchnum 943 memory use: 3765.375MB
iteration: 1 batchnum 959 memory use: 3765.44140625MB
iteration: 1 batchnum 975 memory use: 3765.5078125MB
iteration: 1 batchnum 991 memory use: 3765.58984375MB
| epoch 001: 1000 / 46210 loss=14.347, nll_loss=14.185, ppl=18624.06, wps=16255, ups=0.3, wpb=47786, bsz=1585, num_updates=62, lr=7.84845e-06, gnorm=3.188, clip=100%, oom=0, wall=182
iteration: 1 batchnum 1007 memo
i use the this code to train a model , i find the memory which is increasing in every batch. that lead to out memory in some epoch.
script as follow:
sbatch --partition airesearch_middle --job-name mem_fairseq-py --gres gpu:4 --cpus-per-task 10
--nodes 1 --ntasks-per-node 1
--wrap "srun --output ${savedir}/train.log.node%t --error ${savedir}/train.stderr.node%t.%j
python train.py $DATA
--distributed-world-size 4
--save-dir=${savedir}
--update-freq 16
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000
--lr 0.0005 --min-lr 1e-09
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1
--max-tokens 1000 "
script to get memory as follow:
#!/usr/bin/env python
def print_mem(itnum,bnum):
import os
import psutil
pid = os.getpid()
py = psutil.Process(pid)
memoryUse = py.memory_info()[0]/2.**20
return 'iteration: {} batchnum {} memory use: {}MB'.format(itnum, bnum, memoryUse)
log as follow:
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, clip_norm=0.0, criterion='label_smoothed_cross_entropy', data='data-bin/wmt14_en_de_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://localhost:16693', distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=False, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=1000, max_update=0, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, optimizer='adam', raw_text=False, relu_dropout=0.0, restore_file='checkpoint_last.pt', save_dir='./checkpoints/transformer_vaswani_wmt_en_de_big/mem_single_node_multi_4gpus/', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', train_subset='train', update_freq=[16], valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 32768 types
| [de] dictionary: 32768 types
Load dataset splits
| data-bin/wmt14_en_de_joined_dict train 4528446 examples
| data-bin/wmt14_en_de_joined_dict valid 3000 examples
Build model and criterion
| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 209911808
| training on 4 GPUs
| max tokens per GPU = 1000 and max sentences per GPU = None
iteration: 1 batchnum 15 memory use: 3760.8203125MB
iteration: 1 batchnum 31 memory use: 3761.0703125MB
iteration: 1 batchnum 47 memory use: 3761.12109375MB
iteration: 1 batchnum 63 memory use: 3761.234375MB
iteration: 1 batchnum 79 memory use: 3761.3203125MB
iteration: 1 batchnum 95 memory use: 3761.40625MB
iteration: 1 batchnum 111 memory use: 3761.453125MB
iteration: 1 batchnum 127 memory use: 3761.5546875MB
iteration: 1 batchnum 143 memory use: 3761.62890625MB
iteration: 1 batchnum 159 memory use: 3761.69140625MB
iteration: 1 batchnum 175 memory use: 3761.7578125MB
iteration: 1 batchnum 191 memory use: 3761.87109375MB
iteration: 1 batchnum 207 memory use: 3761.92578125MB
iteration: 1 batchnum 223 memory use: 3762.00390625MB
iteration: 1 batchnum 239 memory use: 3762.08203125MB
iteration: 1 batchnum 255 memory use: 3762.16015625MB
iteration: 1 batchnum 271 memory use: 3762.2421875MB
iteration: 1 batchnum 287 memory use: 3762.3203125MB
iteration: 1 batchnum 303 memory use: 3762.390625MB
iteration: 1 batchnum 319 memory use: 3762.47265625MB
iteration: 1 batchnum 335 memory use: 3762.5546875MB
iteration: 1 batchnum 351 memory use: 3762.63671875MB
iteration: 1 batchnum 367 memory use: 3762.7109375MB
iteration: 1 batchnum 383 memory use: 3762.75390625MB
iteration: 1 batchnum 399 memory use: 3762.828125MB
iteration: 1 batchnum 415 memory use: 3762.87890625MB
iteration: 1 batchnum 431 memory use: 3762.98828125MB
iteration: 1 batchnum 447 memory use: 3763.06640625MB
iteration: 1 batchnum 463 memory use: 3763.12890625MB
iteration: 1 batchnum 479 memory use: 3763.17578125MB
iteration: 1 batchnum 495 memory use: 3763.265625MB
iteration: 1 batchnum 511 memory use: 3763.33203125MB
iteration: 1 batchnum 527 memory use: 3763.4296875MB
iteration: 1 batchnum 543 memory use: 3763.47265625MB
iteration: 1 batchnum 559 memory use: 3763.5625MB
iteration: 1 batchnum 575 memory use: 3763.640625MB
iteration: 1 batchnum 591 memory use: 3763.72265625MB
iteration: 1 batchnum 607 memory use: 3763.81640625MB
iteration: 1 batchnum 623 memory use: 3763.8828125MB
iteration: 1 batchnum 639 memory use: 3763.94921875MB
iteration: 1 batchnum 655 memory use: 3764.02734375MB
iteration: 1 batchnum 671 memory use: 3764.125MB
iteration: 1 batchnum 687 memory use: 3764.16015625MB
iteration: 1 batchnum 703 memory use: 3764.3046875MB
iteration: 1 batchnum 719 memory use: 3764.33984375MB
iteration: 1 batchnum 735 memory use: 3764.4296875MB
iteration: 1 batchnum 751 memory use: 3764.51171875MB
iteration: 1 batchnum 767 memory use: 3764.578125MB
iteration: 1 batchnum 783 memory use: 3764.66015625MB
iteration: 1 batchnum 799 memory use: 3764.72265625MB
iteration: 1 batchnum 815 memory use: 3764.796875MB
iteration: 1 batchnum 831 memory use: 3764.86328125MB
iteration: 1 batchnum 847 memory use: 3764.9375MB
iteration: 1 batchnum 863 memory use: 3765.02734375MB
iteration: 1 batchnum 879 memory use: 3765.125MB
iteration: 1 batchnum 895 memory use: 3765.17578125MB
iteration: 1 batchnum 911 memory use: 3765.2890625MB
iteration: 1 batchnum 927 memory use: 3765.3359375MB
iteration: 1 batchnum 943 memory use: 3765.375MB
iteration: 1 batchnum 959 memory use: 3765.44140625MB
iteration: 1 batchnum 975 memory use: 3765.5078125MB
iteration: 1 batchnum 991 memory use: 3765.58984375MB
| epoch 001: 1000 / 46210 loss=14.347, nll_loss=14.185, ppl=18624.06, wps=16255, ups=0.3, wpb=47786, bsz=1585, num_updates=62, lr=7.84845e-06, gnorm=3.188, clip=100%, oom=0, wall=182
iteration: 1 batchnum 1007 memo