Increase timeout before reconstruction is triggered#3217
Increase timeout before reconstruction is triggered#3217stephanie-wang merged 6 commits intoray-project:masterfrom
Conversation
src/ray/ray_config.h
Outdated
There was a problem hiding this comment.
Should this be set to something more conservative like 10000 until the bug is fixed?
test/stress_tests.py
Outdated
There was a problem hiding this comment.
Could we also add the reproduction script as a stress test?
f6d4c05 to
431ca8e
Compare
|
Test PASSed. |
|
Test PASSed. |
| assert ray.services.all_processes_alive() | ||
|
|
||
|
|
||
| def test_submitting_many_actors_to_one(ray_start_sharded): |
There was a problem hiding this comment.
Does this test fail before this PR?
There was a problem hiding this comment.
It didn't on my laptop, but it was on EC2, which was how we initially found this issue. Travis is failing on this test right now, which unfortunately might be because we're trying to start too many processes.
There was a problem hiding this comment.
I'll try lowering the number of processes, and if Travis is still failing, I'll remove the test.
|
Test PASSed. |
|
Test PASSed. |
|
Test FAILed. |
|
jenkins, retest this please |
|
Test FAILed. |
|
Jenkins error looks unrelated. |
What do these changes do?
This increases the time that each node waits before reconstruction can be triggered for an object that the node needs. This is a temporary solution to reduce the number of spurious reconstructions, but in the future we should figure out a more permanent solution for #3214.
Related issue number
This seems to be at least part of the issue behind #3170.