[tune] [wip] Application-level Fault Tolerance#3165
[tune] [wip] Application-level Fault Tolerance#3165richardliaw wants to merge 19 commits intoray-project:masterfrom
Conversation
|
Test FAILed. |
ericl
left a comment
There was a problem hiding this comment.
Overall I think this change is too invasive. Why modify the trial execution state at all?
It would also be good to have a design doc. For example, what needs to be in the state of the scheduler checkpoint? I can think of
- Hash of the experiment configuration.
- List of all generated trials and their experiment tag.
| logger.debug("progress.csv found; appending to this file.") | ||
| except FileNotFoundError: | ||
| logger.debug("progress.csv not found.") | ||
| labels = None |
There was a problem hiding this comment.
You don't need any of this right? Just open the file for append.
| # Since trial resume after paused should not run | ||
| # trial.train.remote(), thus no more new remote object id generated. | ||
| # We use self._paused to store paused trials here. | ||
| self._paused = {} |
There was a problem hiding this comment.
Do we need to change this? We can just restore these from last checkpoint right?
| logger.info("Restoring result from in-flight trial.") | ||
| # If Trial was in flight when paused, we restore result. | ||
| self._running[ray.put(trial.next_result)] = trial | ||
| trial.next_result = None |
There was a problem hiding this comment.
I don't think we need to do make changes to restore of trial.
There was a problem hiding this comment.
some of these things are orthogonal to this change; I'll separate it into a subsequent PR
|
I don't know what you're referring to by "trial execution state"; if you're talking about changing the trial executor logic, sure, those changes are sort of separate and can be done in a subsequent PR The current trial checkpointing can work without scheduler and search algorithm checkpointing. Scheduler checkpoint and search algorithm checkpoints can be done separately from this PR. |
|
Closing this PR and will reopen when revised (soon). |
Adds Trial meta-data to disk-based trial checkpoints. This allows trials
to be recovered exactly as if nothing has happened.
The only oddity is logging since we don't do any roll-back of the result
logging, but perhaps we can fix later.
TODO:
tune.resume(LOGDIR)?