[core] Add Global State Test for multi-node setting#3239
[core] Add Global State Test for multi-node setting#3239pschafhalter merged 28 commits intoray-project:masterfrom
Conversation
|
Test FAILed. |
|
Test PASSed. |
|
Test FAILed. |
|
Test FAILed. |
## What do these changes do? This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly. Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible. #3239 depends on this. TODO: - [x] Add documentation to method arguments before merging. - [x] Add test to verify this works? ## Related issue number
|
ray/python/ray/test/test_global_state.py Lines 62 to 63 in af0fe9c ray/python/ray/test/test_global_state.py Lines 82 to 83 in af0fe9c Nitpick: Use parentheses instead of breaking lines with Otherwise LGTM! |
|
Test PASSed. |
|
Test FAILed. |
|
Test PASSed. |
|
Test FAILed. |
|
Test FAILed. |
This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851.
Adds global state tests when using multiple machines.