Skip to content

Potential 0.4.1 Memory Leak for a Fairseq Model #9942

@hmc-cs-mdrissi

Description

@hmc-cs-mdrissi

Yesterday, when I was on 0.4.0 I was training a fairseq model (self_att_wp) I could train it fine with batch size of around 4 (technically I control the number of tokens fed into the model). After upgrading to 0.4.1, when training the same model with the same arguments it runs out of memory. Even decreasing the batch size it still runs out of memory after a bit of time with memory just increasing after every couple of batches.

There's an issue in fairseq mention this as well, facebookresearch/fairseq#232. The issue mentions the memory leak with a different fairseq model, so I'd guess that multiple fairseq models now have this issue. The command used to train the fairseq model I've been using are:

python3.6 train.py data-bin/wikitext_outline_to_target -
a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 4000 --lr-scheduler reduce_lr_on_plateau --source-lang wikitext_outline --target-lang wikitext_target --max-epoch 25 --no-epoch-checkpoints --save-di
r model4_checkpoints/

You'll need to replace source-lang/target-lang and the data-bin argument by whatever dataset you end up using. The readme here, https://github.com/pytorch/fairseq/tree/master/examples/stories, describes the commands in some more detail to cover that part and trains the same architecture (some variation in exact arguments, but I don't think they'll matter).

edit: More specifically the error was a cuda out of memory error and doing nvidia-smi I could see the memory increasing over time. I also had upgraded to cuda 9.2/cudnn 7.1.4 so it might be an issue there. The OS was ubuntu 16.04.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions