Skip to content

[Serve] [CI] Add health check grace period to deflake single_deployment_1k_noop_replica#22651

Merged
edoakes merged 1 commit intoray-project:masterfrom
architkulkarni:deflaky-serve-1k
Feb 25, 2022
Merged

[Serve] [CI] Add health check grace period to deflake single_deployment_1k_noop_replica#22651
edoakes merged 1 commit intoray-project:masterfrom
architkulkarni:deflaky-serve-1k

Conversation

@architkulkarni
Copy link
Copy Markdown
Contributor

@architkulkarni architkulkarni commented Feb 25, 2022

Why are these changes needed?

Related issue number

In #22297 a change was made that requires a deployment to be healthy (all replicas healthy) before .deploy() returns, and raises an exception if this doesn't happen. In the 1k replica release test, about 30% of the tim, 5-10 replicas would crash (the reasons are still unknown). This caused the test to fail.

This PR adds a 10 minute grace period, which allows enough time for any crashed actors to restart.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(
    • Manually tested with python ~/ray/release/e2e.py --test-config ~/ray/release/serve_tests/serve_tests.yaml --test-name single_deployment_1k_noop_replica. Also ran a test where I raised an exception in this test, just to double-check that the manual test was actually testing my local file correctly and not a test file from some nightly Ray wheel.

@architkulkarni architkulkarni changed the title [WIP] [Serve] [CI] Add health check grace period to deflake single_deployment_1k_noop_replica [Serve] [CI] Add health check grace period to deflake single_deployment_1k_noop_replica Feb 25, 2022
@architkulkarni
Copy link
Copy Markdown
Contributor Author

test_client and rllib:learning_tests_pendulum_sac flaky on master and unrelated to this PR.

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 25, 2022
Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, did you validate it works manually?

Comment on lines +86 to +87
except RuntimeError:
time.sleep(1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log something here for transparency please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants