[tune] [wip] Application-level Fault Tolerance by richardliaw · Pull Request #3165 · ray-project/ray

richardliaw · 2018-10-30T07:24:18Z

Adds Trial meta-data to disk-based trial checkpoints. This allows trials
to be recovered exactly as if nothing has happened.

The only oddity is logging since we don't do any roll-back of the result
logging, but perhaps we can fix later.

TODO:

Tests
Clean-up
Expose functionality via some API like
tune.resume(LOGDIR)?

AmplabJenkins · 2018-10-30T07:42:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8974/
Test FAILed.

ericl

Overall I think this change is too invasive. Why modify the trial execution state at all?

It would also be good to have a design doc. For example, what needs to be in the state of the scheduler checkpoint? I can think of

Hash of the experiment configuration.
List of all generated trials and their experiment tag.

ericl · 2018-10-31T02:14:13Z

python/ray/tune/logger.py

+            logger.debug("progress.csv found; appending to this file.")
+        except FileNotFoundError:
+            logger.debug("progress.csv not found.")
+            labels = None


You don't need any of this right? Just open the file for append.

ericl · 2018-10-31T02:16:36Z

python/ray/tune/ray_trial_executor.py

-        # Since trial resume after paused should not run
-        # trial.train.remote(), thus no more new remote object id generated.
-        # We use self._paused to store paused trials here.
-        self._paused = {}


Do we need to change this? We can just restore these from last checkpoint right?

ericl · 2018-10-31T02:17:01Z

python/ray/tune/ray_trial_executor.py

+            logger.info("Restoring result from in-flight trial.")
+            # If Trial was in flight when paused, we restore result.
+            self._running[ray.put(trial.next_result)] = trial
+            trial.next_result = None


I don't think we need to do make changes to restore of trial.

some of these things are orthogonal to this change; I'll separate it into a subsequent PR

richardliaw · 2018-10-31T05:27:33Z

I don't know what you're referring to by "trial execution state"; if you're talking about changing the trial executor logic, sure, those changes are sort of separate and can be done in a subsequent PR

The current trial checkpointing can work without scheduler and search algorithm checkpointing. Scheduler checkpoint and search algorithm checkpoints can be done separately from this PR.

richardliaw · 2018-11-05T08:06:53Z

Closing this PR and will reopen when revised (soon).

richardliaw added 19 commits September 18, 2018 10:02

wip

7a33c1f

Merge branch 'master' into tune_fault

3d5f5a2

Merge branch 'master' into tune_fault

88e5937

save_state

268951b

some todos for executor

de9aa25

Resume ability for Trial logging

427f056

temporary test

f06e5d4

wip

90e626b

Merge branch 'master' into tune_fault

0259507

temp changes

cece8e0

small changes

5a6361e

Checkpoint restore test

88399d2

test going

f72481a

Merge branch 'master' into tune_fault

2ec87fc

some update

e4e97b0

Removes pausing queue in Trial Executor

65b6238

Merge branch 'remove_pausing' into tune_fault

f975fd7

debugging everywhere

c810ed8

Fix up restoration

3b5d5bf

richardliaw changed the title ~~[wip] Application-level Fault Tolerance~~ [tune] [wip] Application-level Fault Tolerance Oct 31, 2018

ericl reviewed Oct 31, 2018

View reviewed changes

richardliaw closed this Nov 5, 2018

richardliaw mentioned this pull request Nov 12, 2018

[tune] Cluster Fault Tolerance #3309

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] [wip] Application-level Fault Tolerance#3165

[tune] [wip] Application-level Fault Tolerance#3165
richardliaw wants to merge 19 commits intoray-project:masterfrom
richardliaw:tune_fault

richardliaw commented Oct 30, 2018

Uh oh!

AmplabJenkins commented Oct 30, 2018

Uh oh!

ericl left a comment

Uh oh!

ericl Oct 31, 2018

Uh oh!

ericl Oct 31, 2018

Uh oh!

ericl Oct 31, 2018

Uh oh!

richardliaw Oct 31, 2018

Uh oh!

richardliaw commented Oct 31, 2018

Uh oh!

richardliaw commented Nov 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

richardliaw commented Oct 30, 2018

Uh oh!

AmplabJenkins commented Oct 30, 2018

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

ericl Oct 31, 2018

Choose a reason for hiding this comment

Uh oh!

ericl Oct 31, 2018

Choose a reason for hiding this comment

Uh oh!

ericl Oct 31, 2018

Choose a reason for hiding this comment

Uh oh!

richardliaw Oct 31, 2018

Choose a reason for hiding this comment

Uh oh!

richardliaw commented Oct 31, 2018

Uh oh!

richardliaw commented Nov 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants