Skip to content
This repository was archived by the owner on Mar 20, 2026. It is now read-only.
This repository was archived by the owner on Mar 20, 2026. It is now read-only.

memory increase in every batch that lead to out memory in some epoch #232

@mjc14

Description

@mjc14

i use the this code to train a model , i find the memory which is increasing in every batch. that lead to out memory in some epoch.

script as follow:
sbatch --partition airesearch_middle --job-name mem_fairseq-py --gres gpu:4 --cpus-per-task 10
--nodes 1 --ntasks-per-node 1
--wrap "srun --output ${savedir}/train.log.node%t --error ${savedir}/train.stderr.node%t.%j
python train.py $DATA
--distributed-world-size 4
--save-dir=${savedir}
--update-freq 16
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000
--lr 0.0005 --min-lr 1e-09
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1
--max-tokens 1000 "

script to get memory as follow:

#!/usr/bin/env python
def print_mem(itnum,bnum):
import os
import psutil
pid = os.getpid()
py = psutil.Process(pid)
memoryUse = py.memory_info()[0]/2.**20
return 'iteration: {} batchnum {} memory use: {}MB'.format(itnum, bnum, memoryUse)

log as follow:
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, clip_norm=0.0, criterion='label_smoothed_cross_entropy', data='data-bin/wmt14_en_de_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://localhost:16693', distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=False, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=1000, max_update=0, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, optimizer='adam', raw_text=False, relu_dropout=0.0, restore_file='checkpoint_last.pt', save_dir='./checkpoints/transformer_vaswani_wmt_en_de_big/mem_single_node_multi_4gpus/', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', train_subset='train', update_freq=[16], valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 32768 types
| [de] dictionary: 32768 types
Load dataset splits
| data-bin/wmt14_en_de_joined_dict train 4528446 examples
| data-bin/wmt14_en_de_joined_dict valid 3000 examples
Build model and criterion
| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 209911808
| training on 4 GPUs
| max tokens per GPU = 1000 and max sentences per GPU = None
iteration: 1 batchnum 15 memory use: 3760.8203125MB
iteration: 1 batchnum 31 memory use: 3761.0703125MB
iteration: 1 batchnum 47 memory use: 3761.12109375MB
iteration: 1 batchnum 63 memory use: 3761.234375MB
iteration: 1 batchnum 79 memory use: 3761.3203125MB
iteration: 1 batchnum 95 memory use: 3761.40625MB
iteration: 1 batchnum 111 memory use: 3761.453125MB
iteration: 1 batchnum 127 memory use: 3761.5546875MB
iteration: 1 batchnum 143 memory use: 3761.62890625MB
iteration: 1 batchnum 159 memory use: 3761.69140625MB
iteration: 1 batchnum 175 memory use: 3761.7578125MB
iteration: 1 batchnum 191 memory use: 3761.87109375MB
iteration: 1 batchnum 207 memory use: 3761.92578125MB
iteration: 1 batchnum 223 memory use: 3762.00390625MB
iteration: 1 batchnum 239 memory use: 3762.08203125MB
iteration: 1 batchnum 255 memory use: 3762.16015625MB
iteration: 1 batchnum 271 memory use: 3762.2421875MB
iteration: 1 batchnum 287 memory use: 3762.3203125MB
iteration: 1 batchnum 303 memory use: 3762.390625MB
iteration: 1 batchnum 319 memory use: 3762.47265625MB
iteration: 1 batchnum 335 memory use: 3762.5546875MB
iteration: 1 batchnum 351 memory use: 3762.63671875MB
iteration: 1 batchnum 367 memory use: 3762.7109375MB
iteration: 1 batchnum 383 memory use: 3762.75390625MB
iteration: 1 batchnum 399 memory use: 3762.828125MB
iteration: 1 batchnum 415 memory use: 3762.87890625MB
iteration: 1 batchnum 431 memory use: 3762.98828125MB
iteration: 1 batchnum 447 memory use: 3763.06640625MB
iteration: 1 batchnum 463 memory use: 3763.12890625MB
iteration: 1 batchnum 479 memory use: 3763.17578125MB
iteration: 1 batchnum 495 memory use: 3763.265625MB
iteration: 1 batchnum 511 memory use: 3763.33203125MB
iteration: 1 batchnum 527 memory use: 3763.4296875MB
iteration: 1 batchnum 543 memory use: 3763.47265625MB
iteration: 1 batchnum 559 memory use: 3763.5625MB
iteration: 1 batchnum 575 memory use: 3763.640625MB
iteration: 1 batchnum 591 memory use: 3763.72265625MB
iteration: 1 batchnum 607 memory use: 3763.81640625MB
iteration: 1 batchnum 623 memory use: 3763.8828125MB
iteration: 1 batchnum 639 memory use: 3763.94921875MB
iteration: 1 batchnum 655 memory use: 3764.02734375MB
iteration: 1 batchnum 671 memory use: 3764.125MB
iteration: 1 batchnum 687 memory use: 3764.16015625MB
iteration: 1 batchnum 703 memory use: 3764.3046875MB
iteration: 1 batchnum 719 memory use: 3764.33984375MB
iteration: 1 batchnum 735 memory use: 3764.4296875MB
iteration: 1 batchnum 751 memory use: 3764.51171875MB
iteration: 1 batchnum 767 memory use: 3764.578125MB
iteration: 1 batchnum 783 memory use: 3764.66015625MB
iteration: 1 batchnum 799 memory use: 3764.72265625MB
iteration: 1 batchnum 815 memory use: 3764.796875MB
iteration: 1 batchnum 831 memory use: 3764.86328125MB
iteration: 1 batchnum 847 memory use: 3764.9375MB
iteration: 1 batchnum 863 memory use: 3765.02734375MB
iteration: 1 batchnum 879 memory use: 3765.125MB
iteration: 1 batchnum 895 memory use: 3765.17578125MB
iteration: 1 batchnum 911 memory use: 3765.2890625MB
iteration: 1 batchnum 927 memory use: 3765.3359375MB
iteration: 1 batchnum 943 memory use: 3765.375MB
iteration: 1 batchnum 959 memory use: 3765.44140625MB
iteration: 1 batchnum 975 memory use: 3765.5078125MB
iteration: 1 batchnum 991 memory use: 3765.58984375MB
| epoch 001: 1000 / 46210 loss=14.347, nll_loss=14.185, ppl=18624.06, wps=16255, ups=0.3, wpb=47786, bsz=1585, num_updates=62, lr=7.84845e-06, gnorm=3.188, clip=100%, oom=0, wall=182
iteration: 1 batchnum 1007 memo

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions