[tune] Fast Node Recovery#5053
Conversation
Changes LogSync into a mixin, and adds tests for different functionalities.
|
Test FAILed. |
|
@hartikainen I've been running this for the last day or so on 8 worker nodes and have seen 9 successful restarts thus far. Thank you so much, it seems to work! |
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
|
Test FAILed. |
hartikainen
left a comment
There was a problem hiding this comment.
Apart from the couple of comments, I think this looks good.
| else: | ||
| worker_ip = ray.get(trial.runner.current_ip.remote()) | ||
| trial.sync_logger_to_new_location(worker_ip) | ||
| # This can be very slow - a better fix would |
There was a problem hiding this comment.
Maybe add a TODO here and clarification when this would be very slow? Also might want to create an issue if this can cause problems.
python/ray/tune/trial_runner.py
Outdated
| self._scheduler_alg.on_trial_error(self, trial) | ||
| self.trial_executor.set_status(trial, Trial.PENDING) | ||
|
|
||
| # Right now, this requeues the trial to the end of the queue. This is |
There was a problem hiding this comment.
This should also have a TODO and possibly an issue for fixing this?
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
|
Test FAILed. |
|
Test PASSed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
What do these changes do?
Pre-empting nodes will result in actors sometimes being "lost". This
ends up taking book-keeping space and also resources that aren't filled.
This PR aims to address that.
Related issue number
Linter
scripts/format.shto lint the changes in this PR.