Revert "Set all_reduce_token to None when exiting"#6321
Conversation
|
Let's wait and manually verify all GPU tests. |
|
The GPU CI fails but the CI log didn't show any failing test. Though I can see the error: Will try to reproduce on the last run test |
|
Running the last run test in the CI gave me: The test gives OK result but it actually failed. The error also appeared for other tests such as |
|
I also checked the GPU CI log: Before this PR, the GPU CI actually fails with error: for example. With this PR, I don't see the error message such as So after this PR is merged, the next step is to fix GPU CI so the error won't go unnoticed. |
|
Note, I also need to revert #6268 to make the CI pass. #6268 was backported to r2.2 as well. @ManfeiBai @JackCaoG . So we may want to revert it in the r2.2 release branch |
|
@vanbasten23 sorry for #6247 break the ci. I think the error is due to the ComputationClient has exited in #6247 fixes the issue in #6246, which is a common scenario during testing. |
|
BTW, when using xla_multiprocessing, #6246 will not arise. The mp_test_early_exit.py is not needed. This test should be placed in |
Reverts #6247
See #6320 and #6247 (comment) for detail. Also, we observed error in the GPU CI result in #6247 even though the GPU CI didn't catch it. Need to investigate