Skip to content

[core] Add Global State Test for multi-node setting#3239

Merged
pschafhalter merged 28 commits intoray-project:masterfrom
richardliaw:global_state_multinode
Nov 13, 2018
Merged

[core] Add Global State Test for multi-node setting#3239
pschafhalter merged 28 commits intoray-project:masterfrom
richardliaw:global_state_multinode

Conversation

@richardliaw
Copy link
Copy Markdown
Contributor

Adds global state tests when using multiple machines.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9083/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9105/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9109/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9191/
Test FAILed.

richardliaw added a commit that referenced this pull request Nov 8, 2018
## What do these changes do?

This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly.

Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible.

#3239 depends on this.

TODO:
 - [x] Add documentation to method arguments before merging.
 - [x] Add test to verify this works?

## Related issue number
@pschafhalter
Copy link
Copy Markdown
Contributor

@pytest.mark.skipif(pytest_timeout==None, reason="Timeout package"\
" not installed; skipping test that may hang.")

@pytest.mark.skipif(pytest_timeout==None, reason="Timeout package"\
" not installed; skipping test that may hang.")

Nitpick: Use parentheses instead of breaking lines with \.

Otherwise LGTM!

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9209/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9214/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9212/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9309/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9336/
Test FAILed.

Copy link
Copy Markdown
Contributor

@pschafhalter pschafhalter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pschafhalter pschafhalter merged commit c0423db into ray-project:master Nov 13, 2018
richardliaw added a commit that referenced this pull request Nov 21, 2018
This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants