Skip to content

Detect split brain issues w.r.t. user code and deepspeed batch sizes #84

@jeffra

Description

@jeffra

Often user code will have a user-defined batch size and the DeepSpeed config json will have it's own batch size. When using gradient accumulation this can cause bugs where DeepSpeed thinks grad accumulation steps should be different than what user code is doing.

If the user is using the default collate_fn then DeepSpeed should be able to detect and throw an exception in these cases. We can check to see what batch size is being passed in the forward pass by inspecting the first dimension.

Lastly, we probably want to add a error suppression flag in the DeepSpeed config to allow users to turn off this error if they know what they are doing and their batch alignment is non-standard.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions