Skip to content

[BUG] overflow warning needs to be different for fp16 and non-fp16 #2911

@stas00

Description

@stas00

Describe the bug

This code has an issue when it is run under non-fp16 regime.

https://github.com/microsoft/DeepSpeed/blob/da84e60d98d2e90f6f2094a219c98c8b41582eb9/deepspeed/runtime/zero/stage3.py#L1837-L1842

There are no scalers under bf16/fp32. So this warning is alarming to see - we rushed to see if somehow the config was broken, but it wasn't.

It should only say the Attempted loss scale:... part under fp16.

Most likely the same applies to its counterpart in stage 1/2.

Also do you think it'd be helpful to tell the user specifically if it's Inf vs. NaN? Since NaN isn't really an overflow or does it? Perhaps one of you with a more rigorous math background knows better. I think overflow is one of many types of NaN, thus NaN isn't always on Overflow. Please correct me if I'm wrong.

The reason I'm asking this question is to help the user to know what to look for, NaNs, Infinity, else.

@tjruwase

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions