Skip to content

[BUG] bf16 incorrectly configured in src/transformers/deepspeed.py #16596

@michaelroyzen

Description

@michaelroyzen

Environment info

  • transformers version: 4.17.0
  • Platform: Ubuntu
  • Python version: 3.8
  • PyTorch version (GPU?): 8x A10
  • Tensorflow version (GPU?):
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?:

Who can help

@stas00

Models: T5

Information

I am trying to fine-tune T5 using the Huggingface Trainer in bf16 using its built-in DeepSpeed integration. While I added bf16=True and "bf16": { "enabled": true }, to the TrainingArguments and DeepSpeed config respectively, this flag makes no difference in GPU memory usage or training speed. Hence, I found a typo in src/transformers/deepspeed.py: At line 253,
if self.is_true("bfoat16.enabled"): self._dtype = torch.bfloat16

Instead of bfoat16.enabled, it should be bfloat16.enabled. Yet that in itself is also outdated, as the latest DeepSpeed docs say that bf16 not bfloat16 should be in the DeepSpeed config.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions