Skip to content

[tune] Node Fault Tolerance#3238

Merged
richardliaw merged 60 commits intoray-project:masterfrom
richardliaw:tune_cluster
Nov 21, 2018
Merged

[tune] Node Fault Tolerance#3238
richardliaw merged 60 commits intoray-project:masterfrom
richardliaw:tune_cluster

Conversation

@richardliaw
Copy link
Copy Markdown
Contributor

@richardliaw richardliaw commented Nov 5, 2018

This PR introduces single-node fault tolerance for Tune.

Previous behavior:

  • Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

New behavior:

  • RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available).
  • If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
  • During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via trial_runner.stop_trial) so that they don’t wait/block for a trial that isn’t running.

Remaining questions:

Closes #2851.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9081/
Test FAILed.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Nov 6, 2018

Is this intended to test surviving a cluster restart as well? There's a nice ballistic test you can write for that:

while True:
   `ray start`
   run_experiments(...)
   after a random period of time, `ray stop`

Eventually, it should finish and you should have results for all trials.

@richardliaw
Copy link
Copy Markdown
Contributor Author

Oh, no not yet... This is intended to be just single node removal tests (which ideally should work, but I don't actually think it does right now)

logger.exception("Error recovering trial from checkpoint, abort.")
self.stop_trial(trial, error=True, error_msg=error_msg)
else:
trial.status = Trial.PENDING
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joyyoj I removed the try-catch here because this method is only invoked by the trial_runner and that already has a try-catch for this.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9313/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9314/
Test FAILed.

@richardliaw richardliaw mentioned this pull request Nov 12, 2018
2 tasks
@ericl ericl self-assigned this Nov 13, 2018
"""
if self.checkpoint_freq > 0:
# Edge case of beginning trial
if (self.checkpoint_freq > self.last_result[TRAINING_ITERATION]
Copy link
Copy Markdown
Contributor

@ericl ericl Nov 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not self.checkpoint_freq > 0 for the entire function?

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9353/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9357/
Test FAILed.

@richardliaw
Copy link
Copy Markdown
Contributor Author

richardliaw commented Nov 15, 2018 via email

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9384/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9385/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9425/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9511/
Test FAILed.

@richardliaw
Copy link
Copy Markdown
Contributor Author

jenkins retest this please

@richardliaw
Copy link
Copy Markdown
Contributor Author

Tests dying on SGD, but relevant tests passing..

@richardliaw richardliaw merged commit 784a639 into ray-project:master Nov 21, 2018
@richardliaw richardliaw deleted the tune_cluster branch November 21, 2018 20:38
@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9518/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants