do not scale gradient in bf16 mode#21428
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
stas00
left a comment
There was a problem hiding this comment.
Thank you, Kashif. This has been long overdue!
sgugger
left a comment
There was a problem hiding this comment.
Thanks for working on this! I think we can clean up a tiny bit more the code but this is the crux of the issue.
src/transformers/trainer.py
Outdated
| else: | ||
| self.do_grad_scaling = False | ||
| self.use_cuda_amp = False | ||
| self.amp_dtype = None |
There was a problem hiding this comment.
Just realized there is this else block here. Clearly self.do_grad_scaling = False is not necessary, but you might need to have the two other lines somewhere else.
@pacman100 FSDP doesn't handle bfloat16 at all?
There was a problem hiding this comment.
Hello @sgugger, similar to DeepSpeed, FSDP also manages their own half-precision, however, for FP16 it needs ShardedGradScaler. Here's an example notebook from PyTorch team wrt FSDP MixedPrecision: https://github.com/lessw2020/transformer_central/blob/main/mixed_precision/mixed_precision_fsdp.ipynb
What does this PR do?
Turn off gradient scaling in the trainer when bf16 mode is selected. Only use gradient scaling in float16 mode.
Who can review?
@sgugger and @stas00