Skip to content

[Serve] Serve failure test hotfix.#8928

Closed
rkooo567 wants to merge 1 commit intoray-project:masterfrom
rkooo567:serve-failure-test-investigation
Closed

[Serve] Serve failure test hotfix.#8928
rkooo567 wants to merge 1 commit intoray-project:masterfrom
rkooo567:serve-failure-test-investigation

Conversation

@rkooo567
Copy link
Copy Markdown
Contributor

Why are these changes needed?

When ray.get(master_actor.create_backend.remote(backend_tag, backend_config, replica_config)) is called by serve.create_backend, this raises an RayActorError exception because master_actor just literally killed. At the release test, we don't catch this exception, so it crashes the driver (because there's uncaught exception). This kills Raylet & GCS server, which introduces a weird errors in there.

I don't know what's the right behavior in this case, but it seems like wrapping serve.create_backend, serve.create_endpoint, serve.delete_endpoint, serve.delete_endpoint by

try: 
except ray.exceptions.RayActorError:

resolves the issue.

Conclusion: the problem is that we don't catch RayActorError anymore, and that crashes the driver. In the past, we manually caught RayActorError to retry actor task, but not anymore after we rely on Ray's actor retries mechanism. related PR: 4155d58)

##NOTE
This is a hotfix, and I am not sure if it is the right behavior you want. Feel free to reject the PR if we need other solutions.

Related issue number

Closes #8915

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

@rkooo567 rkooo567 requested a review from edoakes June 14, 2020 05:13
@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27117/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants