Experimental support for fairscale ShardedDDP by sgugger · Pull Request #9139 · huggingface/transformers

sgugger · 2020-12-15T20:47:32Z

What does this PR do?

This PR adds support for FairScale's shared DDP training to save GPU memory when training distributed models. Initial tests see a nice reduction of GPU memory used indeed!

This follows the steps of the main example provided on the FairScale repo, integrating them in our Trainer API. To activate training with shared DDP, one must pass along the flag --sharded_ddp in a distributed launch command.

Benchmarks tried:

a fine-tuning on MRPC with bert_base_uncased -> goes from 5GB per GPU to 4GB per GPU with no hurt on accuracy
a fine-tuning on SQUAD v2 with xlnet_large-cased -> goes from 11.5GB per GPU to 8GB per GPU (didn't go until the end so didn't check if the accuracy was the same. Training loss seemed equivalent.)

stas00 · 2020-12-15T21:07:16Z

wrt your notes on GPU memory consumption improvements - from what I have seen checking GPU allocation often doesn't show the real difference, as pytorch tends to use more than it absolutely needs if there is spare memory - or rather it can go with less when the memory is tight - so to get the best improvements stats it's the best to try to push instead the BS until it OOMs, and then you get a more precise difference - which usually leads to more precise improvement numbers than just comparing memory allocation. This is just in my experience.

All I'm saying is that probably the improvements are even better than what they seem.

LysandreJik

That's cool, very clean!

stas00 · 2020-12-15T21:33:59Z

finetune_trainer crashes with this option:

export BS=4; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0   python -m torch.distributed.launch --nproc_per_node=2  ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --fp16 --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --sharded_ddp

Traceback (most recent call last):
 File "./finetune_trainer.py", line 379, in <module>
   main()
 File "./finetune_trainer.py", line 315, in main
   trainer.train(
 File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 677, in train
   model = ShardedDDP(model, self.optimizer)
 File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 96, in __init__
   self._param_iterator = chain(*[optim.should_bucket_param.keys() for optim in self.sharded_optimizers])
TypeError: 'AdamW' object is not iterable

could probably extend test_finetune_trainer.py to deploy this option if fairscale is available? but CIs won't have it - and it's quite slow to build

sgugger · 2020-12-15T21:40:06Z

Oh it's just because it overrides the create_optimizer_and_scheduler method. Will fix that method.

stas00 · 2020-12-15T22:27:53Z

OK, next we have this:

Traceback (most recent call last):
  File "./finetune_trainer.py", line 379, in <module>
    main()
  File "./finetune_trainer.py", line 315, in main
    trainer.train(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 818, in train
    self.scaler.step(self.optimizer)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 330, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.

Coincidentally I have just had the same issue with deepspeed integration when I enable its internal fp16 handling. Didn't get to the root of it yet, but removing --fp16 arg and thus disabling all the fp16 handling trainer does removed this error.

note: I'm switching to deepspeed fp16 handling there...

sgugger · 2020-12-15T22:29:39Z

Is it FP16 with AMP or with apex? I don't believe fairscale is compatible with apex.

stas00 · 2020-12-15T22:32:02Z

native amp

See the command line I'm testing with at:
#9139 (comment)

stas00 · 2020-12-16T03:14:09Z

If you're joining in and discovered you can't build fairscale, please see this and perhaps that.

patrickvonplaten · 2020-12-16T09:23:19Z

            other choices will force the requested backend.
+        sharded_ddp (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Use Sharded DDP training from `FairScale <https://github.com/facebookresearch/fairscale>`__ (in distributed
+            training only). This is an experimental feature.


patrickvonplaten · 2020-12-16T09:25:49Z



+if is_fairscale_available():
+    from fairscale.optim import OSS


OSS is a bit cryptic to me, but I think it's still better to use the "real" name instead of import OSS as OptimizerStateSharding -> so good for me!

Yeah I'm using the same convention they do too, to not surprise any user.

patrickvonplaten

Clean!

blefaudeux · 2020-12-17T00:31:17Z

OK, next we have this:
Traceback (most recent call last):
  File "./finetune_trainer.py", line 379, in <module>
    main()
  File "./finetune_trainer.py", line 315, in main
    trainer.train(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 818, in train
    self.scaler.step(self.optimizer)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 330, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.
Coincidentally I have just had the same issue with deepspeed integration when I enable its internal fp16 handling. Didn't get to the root of it yet, but removing --fp16 arg and thus disabling all the fp16 handling trainer does removed this error.

note: I'm switching to deepspeed fp16 handling there...

hey there, a bit late, but one of the fairscale/shardedDDP author. The issue with Apex (and vanilla Torch) grad scaler is that it does not know about the gradient sharding, so not all the ranks will have the same behaviour. Torch AMP is supported though, you just have to pass in the ShardedGradScaler as defined here https://github.com/facebookresearch/fairscale/blob/master/fairscale/optim/grad_scaler.py#L24

sgugger · 2020-12-17T12:46:04Z

Yes, we're passing that scaler :-) The issue was with AMP not Apex. It looks like there is a problem with or without FP16 with one of models.
Ah reading more, I see there is a lot on the issue I posted so will look there. Thanks for coming helping us!

Experimental stupport for fairscale ShardedDDP

add3868

sgugger requested review from LysandreJik, julien-c and patrickvonplaten December 15, 2020 20:47

Add import error if fairscale not available

6d5d261

stas00 reviewed Dec 15, 2020

View reviewed changes

Comment thread src/transformers/trainer.py Outdated

LysandreJik approved these changes Dec 15, 2020

View reviewed changes

stas00 reviewed Dec 15, 2020

View reviewed changes

Comment thread src/transformers/trainer.py Outdated

Address review comments

81b2ce4

stas00 changed the title ~~Experimental stupport for fairscale ShardedDDP~~ Experimental support for fairscale ShardedDDP Dec 15, 2020

sgugger and others added 2 commits December 15, 2020 16:41

Merge branch 'master' into sharded_ddp

1546b7a

Fix seq2seq trainer

17be29f

patrickvonplaten reviewed Dec 16, 2020

View reviewed changes

patrickvonplaten approved these changes Dec 16, 2020

View reviewed changes

sgugger merged commit 9a67185 into master Dec 16, 2020

sgugger deleted the sharded_ddp branch December 16, 2020 18:47

This was referenced Dec 17, 2020

add tests for the new sharded ddp fairscale integration #9177

Merged

[pip] wanted: PTX/JIT build and even better binary wheels facebookresearch/fairscale#264

Closed

Conversation

sgugger commented Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

stas00 commented Dec 15, 2020

Uh oh!

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stas00 commented Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Dec 15, 2020

Uh oh!

stas00 commented Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Dec 15, 2020

Uh oh!

stas00 commented Dec 15, 2020

Uh oh!

stas00 commented Dec 16, 2020

Uh oh!

patrickvonplaten Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

sgugger Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

blefaudeux commented Dec 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Dec 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sgugger commented Dec 15, 2020 •

edited

Loading

stas00 commented Dec 15, 2020 •

edited

Loading

stas00 commented Dec 15, 2020 •

edited

Loading

blefaudeux commented Dec 17, 2020 •

edited

Loading