[Serve] Failure test uses ray.kill(no_restart=True)

### What is the problem?

*Ray version and other system information (Python version, TensorFlow version, OS):*

#TL;DR
When an actor task failed because the actor died, it throws an `ActorTaskException` although `max_retries` is set to be `-1`. It should not throw an exception in this case. 

# Story
- https://github.com/ray-project/ray/issues/8915 dicovers that `serve_failure` test consistently fails with random GCS error.
- Turns out the issue was that `serve.create_backend` throw an `ActorTaskException` because the master actor was dead. I had initial hotfix here. https://github.com/ray-project/ray/pull/8928
- Edward pointed it out that if `max_retires=-1`, it should not throw `ActorTaskException`, and that's why serve code didn't catch `ActorTaskException` when it calls `ray.get(master.create_backend.remote())`.

# Question
- We probably should have timeout warning in case the actor task failed because of an application error? 


### Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have **no external library dependencies** (i.e., use fake or mock data / environments):

Run `./ci/long_running_test/workloads/serve_failure.py`. `Serve_failure.py` will fail within one minute with the error written here https://github.com/ray-project/ray/issues/8915. Note that error message is not accurate (it happens because `ActorTaskException` is uncaught, and that breaks the driver, which causes Raylet/GCS server to exit). 

The error occurs because `serve_failure.py` kills master actor with some probability, and then this https://github.com/ray-project/ray/blob/1583cd14ef14e8aac19ce38f80e25feeed278a39/python/ray/serve/api.py#L242 or this https://github.com/ray-project/ray/blob/1583cd14ef14e8aac19ce38f80e25feeed278a39/python/ray/serve/api.py#L262 throws `ActorTaskError` exception, which crashes the driver.

## Desired End Result:
`serve_failure.py` should not fail.

If we cannot run your script, we cannot fix your issue.

- [ ] I have verified my script runs in a clean environment and reproduces the issue.
- [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/latest/installation.html).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Failure test uses ray.kill(no_restart=True) #8949

What is the problem?

Story

Question

Reproduction (REQUIRED)

Desired End Result:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Serve] Failure test uses ray.kill(no_restart=True) #8949

Description

What is the problem?

Story

Question

Reproduction (REQUIRED)

Desired End Result:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions