[tune] Node Fault Tolerance#3238
Conversation
|
Test FAILed. |
|
Is this intended to test surviving a cluster restart as well? There's a nice ballistic test you can write for that: Eventually, it should finish and you should have results for all trials. |
|
Oh, no not yet... This is intended to be just single node removal tests (which ideally should work, but I don't actually think it does right now) |
| logger.exception("Error recovering trial from checkpoint, abort.") | ||
| self.stop_trial(trial, error=True, error_msg=error_msg) | ||
| else: | ||
| trial.status = Trial.PENDING |
There was a problem hiding this comment.
@joyyoj I removed the try-catch here because this method is only invoked by the trial_runner and that already has a try-catch for this.
|
Test FAILed. |
|
Test FAILed. |
python/ray/tune/trial.py
Outdated
| """ | ||
| if self.checkpoint_freq > 0: | ||
| # Edge case of beginning trial | ||
| if (self.checkpoint_freq > self.last_result[TRAINING_ITERATION] |
There was a problem hiding this comment.
Why not self.checkpoint_freq > 0 for the entire function?
|
Test FAILed. |
|
Test FAILed. |
|
Looks like cluster_tests.py is timing out...
…On Wed, Nov 14, 2018 at 5:43 PM UCB AMPLab ***@***.***> wrote:
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9357/
Test FAILed.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3238 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEUc5aWWRpBgTkQlGezu9PLbpqdxPeq4ks5uvMa6gaJpZM4YN8Zd>
.
|
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
jenkins retest this please |
|
Tests dying on SGD, but relevant tests passing.. |
|
Test FAILed. |
This PR introduces single-node fault tolerance for Tune.
Previous behavior:
New behavior:
trial_runner.stop_trial) so that they don’t wait/block for a trial that isn’t running.Remaining questions:
Should
last_resultbe consistent during restore?Yes; but not for earlier trials (trials that are yet to be checkpointed).
Waiting for some PRs to merge first ([core] Add Global State Test for multi-node setting #3239)
Closes #2851.