Skip to content

[Serve] Failure test uses ray.kill(no_restart=True) #8949

@rkooo567

Description

@rkooo567

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):

#TL;DR
When an actor task failed because the actor died, it throws an ActorTaskException although max_retries is set to be -1. It should not throw an exception in this case.

Story

Question

  • We probably should have timeout warning in case the actor task failed because of an application error?

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

Run ./ci/long_running_test/workloads/serve_failure.py. Serve_failure.py will fail within one minute with the error written here #8915. Note that error message is not accurate (it happens because ActorTaskException is uncaught, and that breaks the driver, which causes Raylet/GCS server to exit).

The error occurs because serve_failure.py kills master actor with some probability, and then this

ray.get(
or this
ray.get(master_actor.delete_backend.remote(backend_tag))
throws ActorTaskError exception, which crashes the driver.

Desired End Result:

serve_failure.py should not fail.

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions