[tune] Node Fault Tolerance by richardliaw · Pull Request #3238 · ray-project/ray

richardliaw · 2018-11-05T08:05:22Z

This PR introduces single-node fault tolerance for Tune.

Previous behavior:

Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

New behavior:

RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available).
If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via trial_runner.stop_trial) so that they don’t wait/block for a trial that isn’t running.

Remaining questions:

Should last_result be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).
Waiting for some PRs to merge first ([core] Add Global State Test for multi-node setting #3239)

Closes #2851.

AmplabJenkins · 2018-11-05T09:20:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9081/
Test FAILed.

ericl · 2018-11-06T06:19:51Z

Is this intended to test surviving a cluster restart as well? There's a nice ballistic test you can write for that:

while True:
   `ray start`
   run_experiments(...)
   after a random period of time, `ray stop`

Eventually, it should finish and you should have results for all trials.

richardliaw · 2018-11-06T06:22:09Z

Oh, no not yet... This is intended to be just single node removal tests (which ideally should work, but I don't actually think it does right now)

…fig_updating

richardliaw · 2018-11-12T21:48:19Z

python/ray/tune/trial_executor.py

-            logger.exception("Error recovering trial from checkpoint, abort.")
-            self.stop_trial(trial, error=True, error_msg=error_msg)
+        else:
+            trial.status = Trial.PENDING


@joyyoj I removed the try-catch here because this method is only invoked by the trial_runner and that already has a try-catch for this.

AmplabJenkins · 2018-11-12T23:32:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9313/
Test FAILed.

AmplabJenkins · 2018-11-12T23:48:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9314/
Test FAILed.

ericl · 2018-11-13T21:27:50Z

python/ray/tune/trial.py

+        """
+        if self.checkpoint_freq > 0:
+            # Edge case of beginning trial
+            if (self.checkpoint_freq > self.last_result[TRAINING_ITERATION]


Why not self.checkpoint_freq > 0 for the entire function?

AmplabJenkins · 2018-11-14T22:07:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9353/
Test FAILed.

AmplabJenkins · 2018-11-15T01:43:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9357/
Test FAILed.

richardliaw · 2018-11-15T01:46:30Z

Looks like cluster_tests.py is timing out...

…

On Wed, Nov 14, 2018 at 5:43 PM UCB AMPLab ***@***.***> wrote: Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9357/ Test FAILed. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3238 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUc5aWWRpBgTkQlGezu9PLbpqdxPeq4ks5uvMa6gaJpZM4YN8Zd> .

AmplabJenkins · 2018-11-15T22:23:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9384/
Test FAILed.

AmplabJenkins · 2018-11-16T01:29:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9385/
Test FAILed.

AmplabJenkins · 2018-11-18T01:57:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9425/
Test FAILed.

AmplabJenkins · 2018-11-21T09:00:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9511/
Test FAILed.

richardliaw · 2018-11-21T18:15:42Z

jenkins retest this please

richardliaw · 2018-11-21T20:37:50Z

Tests dying on SGD, but relevant tests passing..

AmplabJenkins · 2018-11-21T21:16:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9518/
Test FAILed.

richardliaw added 4 commits November 4, 2018 21:44

[tune] Throw on overstepping

e557b6a

Add Tune Multi-Node Tests

b755785

Add cluster bookkeeping code

32d1242

add test for adding node

9ec3a60

richardliaw added 8 commits November 5, 2018 15:12

multinode test fixes

44fe1e2

First pass at allowing updatable values

d9c9e3b

Fix compilation issues

d6cade1

Merge branch 'config_updating' into global_state_multinode

ac74520

Add config file parsing

a95c718

Full initialization

5814655

Merge branch 'config_updating' into global_state_multinode

f63df3f

Wrote a good test

2824836

richardliaw and others added 15 commits November 6, 2018 15:34

Merge branch 'config_updating' into tune_cluster

6e7bd6a

configuration parsing and stuff

4842481

docs

8e52103

write some tests, make it good

83d6947

Merge branch 'master' into config_updating

4349adf

fixed init

8078967

Add all config options and bring back stress tests.

2db9f18

Merge branch 'config_updating' into tune_cluster

cc8fca2

Update python/ray/worker.py

59480dc

Update python/ray/worker.py

6fa9d7c

TEMP

856547c

Fix internalization

25e45cd

Merge branch 'config_updating' of github.com:richardliaw/ray into con…

2e2b8b0

…fig_updating

some last changes

d3fa8f0

Merge branch 'config_updating' into tune_cluster

233f3ee

richardliaw commented Nov 12, 2018

View reviewed changes

richardliaw added 3 commits November 12, 2018 14:16

Track last result

a1a05f0

Merge branch 'global_state_multinode' into tune_cluster

5abf9d1

note

6ecc2bb

richardliaw mentioned this pull request Nov 12, 2018

[tune] Cluster Fault Tolerance #3309

Merged

2 tasks

ericl self-assigned this Nov 13, 2018

ericl reviewed Nov 13, 2018

View reviewed changes

ericl approved these changes Nov 13, 2018

View reviewed changes

richardliaw added 2 commits November 14, 2018 12:50

Merge branch 'master' into tune_cluster

1239c1a

fix up tests and checkpointing

7aab84f

import error

1e8a33d

richardliaw added 2 commits November 15, 2018 14:20

timeout?

637e707

lint

162b308

Merge branch 'master' into tune_cluster

a65fc45

lint

0541f92

richardliaw merged commit 784a639 into ray-project:master Nov 21, 2018

richardliaw deleted the tune_cluster branch November 21, 2018 20:38

richardliaw mentioned this pull request Dec 11, 2018

[tune] Cluster cannot handle lost machine #2682

Closed

Conversation

richardliaw commented Nov 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Previous behavior:

New behavior:

Uh oh!

AmplabJenkins commented Nov 5, 2018

Uh oh!

ericl commented Nov 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

richardliaw commented Nov 6, 2018

Uh oh!

richardliaw Nov 12, 2018

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Nov 12, 2018

Uh oh!

AmplabJenkins commented Nov 12, 2018

Uh oh!

ericl Nov 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Nov 14, 2018

Uh oh!

AmplabJenkins commented Nov 15, 2018

Uh oh!

richardliaw commented Nov 15, 2018 via email

Uh oh!

AmplabJenkins commented Nov 15, 2018

Uh oh!

AmplabJenkins commented Nov 16, 2018

Uh oh!

AmplabJenkins commented Nov 18, 2018

Uh oh!

AmplabJenkins commented Nov 21, 2018

Uh oh!

richardliaw commented Nov 21, 2018

Uh oh!

richardliaw commented Nov 21, 2018

Uh oh!

AmplabJenkins commented Nov 21, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

richardliaw commented Nov 5, 2018 •

edited

Loading

ericl commented Nov 6, 2018 •

edited

Loading

ericl Nov 13, 2018 •

edited

Loading