Skip to content

Catch spawned process failures and terminate #82

@ShadenSmith

Description

@ShadenSmith

The DeepSpeed launcher should detect failed processes and then ensure that the remaining children are joined with a timeout. The distributed_test decorator does this. We should more rigorously evaluate that and see if it's appropriate for deepspeed_run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions