Detect split brain issues w.r.t. user code and deepspeed batch sizes

Often user code will have a user-defined batch size and the DeepSpeed config json will have it's own batch size. When using gradient accumulation this can cause bugs where DeepSpeed thinks grad accumulation steps should be different than what user code is doing.

If the user is using the default [collate_fn](https://pytorch.org/docs/stable/data.html#loading-batched-and-non-batched-data) then DeepSpeed should be able to detect and throw an exception in these cases. We can check to see what batch size is being passed in the forward pass by inspecting the first dimension.

Lastly, we probably want to add a error suppression flag in the DeepSpeed config to allow users to turn off this error if they know what they are doing and their batch alignment is non-standard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect split brain issues w.r.t. user code and deepspeed batch sizes #84

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Detect split brain issues w.r.t. user code and deepspeed batch sizes #84

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions