[Serve] [CI] Add health check grace period to deflake single_deployment_1k_noop_replica#22651
Merged
edoakes merged 1 commit intoray-project:masterfrom Feb 25, 2022
Merged
Conversation
single_deployment_1k_noop_replicasingle_deployment_1k_noop_replica
Contributor
Author
|
test_client and rllib:learning_tests_pendulum_sac flaky on master and unrelated to this PR. |
edoakes
reviewed
Feb 25, 2022
Collaborator
edoakes
left a comment
There was a problem hiding this comment.
Looks good, did you validate it works manually?
Comment on lines
+86
to
+87
| except RuntimeError: | ||
| time.sleep(1) |
Collaborator
There was a problem hiding this comment.
log something here for transparency please
edoakes
approved these changes
Feb 25, 2022
simonsays1980
pushed a commit
to simonsays1980/ray
that referenced
this pull request
Feb 27, 2022
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
In #22297 a change was made that requires a deployment to be healthy (all replicas healthy) before
.deploy()returns, and raises an exception if this doesn't happen. In the 1k replica release test, about 30% of the tim, 5-10 replicas would crash (the reasons are still unknown). This caused the test to fail.This PR adds a 10 minute grace period, which allows enough time for any crashed actors to restart.
Checks
scripts/format.shto lint the changes in this PR.python ~/ray/release/e2e.py --test-config ~/ray/release/serve_tests/serve_tests.yaml --test-name single_deployment_1k_noop_replica. Also ran a test where I raised an exception in this test, just to double-check that the manual test was actually testing my local file correctly and not a test file from some nightly Ray wheel.