The DeepSpeed launcher should detect failed processes and then ensure that the remaining children are joined with a timeout. The distributed_test decorator does this. We should more rigorously evaluate that and see if it's appropriate for deepspeed_run.