Skip to content

Reconstruct failed actors without sending tasks.#5161

Merged
stephanie-wang merged 8 commits intoray-project:masterfrom
antgroup:fast_reconstruct
Jul 15, 2019
Merged

Reconstruct failed actors without sending tasks.#5161
stephanie-wang merged 8 commits intoray-project:masterfrom
antgroup:fast_reconstruct

Conversation

@raulchen
Copy link
Copy Markdown
Contributor

@raulchen raulchen commented Jul 10, 2019

What do these changes do?

Previously, we had to send a task to trigger the reconstruction of a failed actor. This has issues in some cases. For example, an actor that reading data from external DB will never receive tasks. This PR fixes this issue.

Related issue number

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@raulchen raulchen requested a review from stephanie-wang July 10, 2019 14:00
@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15279/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1589/
Test FAILed.

@stephanie-wang stephanie-wang self-assigned this Jul 10, 2019
Copy link
Copy Markdown
Contributor

@zhijunfu zhijunfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. thanks

Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good!

raise Exception("Timing out of wait.")


def wait_for_contition(condition_predictor,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def wait_for_contition(condition_predictor,
def wait_for_condition(condition_predictor,

def wait_for_contition(condition_predictor,
timeout_ms=1000,
retry_interval_ms=100):
"""A helper function that wait until a conition is met.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""A helper function that wait until a conition is met.
"""A helper function that waits until a condition is met.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1636/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15328/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15332/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1640/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1641/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15333/
Test PASSed.

@stephanie-wang
Copy link
Copy Markdown
Contributor

Looks like the unit test that was added failed on one of the Travis runs: https://travis-ci.com/ray-project/ray/jobs/215417720. We should increase the timeout for that test.

@raulchen
Copy link
Copy Markdown
Contributor Author

thanks, increased to 5s

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15366/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1670/
Test FAILed.

@raulchen
Copy link
Copy Markdown
Contributor Author

@stephanie-wang Tests have passed. Can you give a stamp?

Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :)

@stephanie-wang stephanie-wang merged commit ea6aa64 into ray-project:master Jul 15, 2019
@raulchen raulchen deleted the fast_reconstruct branch July 16, 2019 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants