-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Description
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS):
#TL;DR
When an actor task failed because the actor died, it throws an ActorTaskException although max_retries is set to be -1. It should not throw an exception in this case.
Story
- [Release] Serve failure test check fails; Check failed: num_attempts < RayConfig::instance().gcs_service_connect_retries() No entry found for GcsServerAddress #8915 dicovers that
serve_failuretest consistently fails with random GCS error. - Turns out the issue was that
serve.create_backendthrow anActorTaskExceptionbecause the master actor was dead. I had initial hotfix here. [Serve] Serve failure test hotfix. #8928 - Edward pointed it out that if
max_retires=-1, it should not throwActorTaskException, and that's why serve code didn't catchActorTaskExceptionwhen it callsray.get(master.create_backend.remote()).
Question
- We probably should have timeout warning in case the actor task failed because of an application error?
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
Run ./ci/long_running_test/workloads/serve_failure.py. Serve_failure.py will fail within one minute with the error written here #8915. Note that error message is not accurate (it happens because ActorTaskException is uncaught, and that breaks the driver, which causes Raylet/GCS server to exit).
The error occurs because serve_failure.py kills master actor with some probability, and then this
Line 242 in 1583cd1
| ray.get( |
Line 262 in 1583cd1
| ray.get(master_actor.delete_backend.remote(backend_tag)) |
ActorTaskError exception, which crashes the driver.
Desired End Result:
serve_failure.py should not fail.
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.