Skip to content

[tune] Fast Node Recovery#5053

Merged
richardliaw merged 67 commits intoray-project:masterfrom
richardliaw:fix_autoscaling_tune
Jul 12, 2019
Merged

[tune] Fast Node Recovery#5053
richardliaw merged 67 commits intoray-project:masterfrom
richardliaw:fix_autoscaling_tune

Conversation

@richardliaw
Copy link
Copy Markdown
Contributor

@richardliaw richardliaw commented Jun 28, 2019

What do these changes do?

Pre-empting nodes will result in actors sometimes being "lost". This
ends up taking book-keeping space and also resources that aren't filled.

This PR aims to address that.

Related issue number

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1450/
Test FAILed.

@SudeepDasari
Copy link
Copy Markdown

@hartikainen I've been running this for the last day or so on 8 worker nodes and have seen 9 successful restarts thus far. Thank you so much, it seems to work!

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15126/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1473/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15263/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1573/
Test FAILed.

Copy link
Copy Markdown
Contributor

@hartikainen hartikainen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the couple of comments, I think this looks good.

else:
worker_ip = ray.get(trial.runner.current_ip.remote())
trial.sync_logger_to_new_location(worker_ip)
# This can be very slow - a better fix would
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a TODO here and clarification when this would be very slow? Also might want to create an issue if this can cause problems.

self._scheduler_alg.on_trial_error(self, trial)
self.trial_executor.set_status(trial, Trial.PENDING)

# Right now, this requeues the trial to the end of the queue. This is
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also have a TODO and possibly an issue for fixing this?

richardliaw and others added 4 commits July 10, 2019 18:39
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1602/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15291/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15293/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1608/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15299/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1626/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15317/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1629/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15320/
Test PASSed.

@richardliaw richardliaw merged commit 1530389 into ray-project:master Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants