Skip to content

Increase timeout before reconstruction is triggered#3217

Merged
stephanie-wang merged 6 commits intoray-project:masterfrom
stephanie-wang:increase-task-lease
Nov 6, 2018
Merged

Increase timeout before reconstruction is triggered#3217
stephanie-wang merged 6 commits intoray-project:masterfrom
stephanie-wang:increase-task-lease

Conversation

@stephanie-wang
Copy link
Copy Markdown
Contributor

What do these changes do?

This increases the time that each node waits before reconstruction can be triggered for an object that the node needs. This is a temporary solution to reduce the number of spurious reconstructions, but in the future we should figure out a more permanent solution for #3214.

Related issue number

This seems to be at least part of the issue behind #3170.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be set to something more conservative like 10000 until the bug is fixed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add the reproduction script as a stress test?

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9026/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9023/
Test PASSed.

assert ray.services.all_processes_alive()


def test_submitting_many_actors_to_one(ray_start_sharded):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test fail before this PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't on my laptop, but it was on EC2, which was how we initially found this issue. Travis is failing on this test right now, which unfortunately might be because we're trying to start too many processes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try lowering the number of processes, and if Travis is still failing, I'll remove the test.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9040/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9050/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9069/
Test FAILed.

@robertnishihara
Copy link
Copy Markdown
Collaborator

jenkins, retest this please

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9074/
Test FAILed.

@stephanie-wang stephanie-wang merged commit bf88aa5 into ray-project:master Nov 6, 2018
@stephanie-wang stephanie-wang deleted the increase-task-lease branch November 6, 2018 02:03
@stephanie-wang
Copy link
Copy Markdown
Contributor Author

Jenkins error looks unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants