Skip to content

[Serve] Improve ray serve gRPC support #58851

@jpatra72

Description

@jpatra72

Description

Possible Improvement 1: The grpc status code and detail message set in the grpc context by the deployment seems to be discarded here and replaced by the INTERNAL error code and exception stack trace. The status code encodes important semantics that the client needs to work correctly (e.g. is retryable or not). In particular, if the grpc context has set INVALID ARGUMENTS (not retriable) vs RESOURCE_EXHAUSTED status code (retryable), this is lost when the ray serve code replaces it with the generic INTERNAL error status code. See list of gRPC status codes here. This could be a case of user error, but I poked around for some time and couldn't figure out how to get the grpc context exit code and detail passed along from a streaming generator to the proxy properly.

Possible Improvement 2: The current implementation packs the full stack trace into the status detail string here. This proto is sent as part of the http2 trailer, which has a fixed limit. If the stack trace is long, the limit is hit and a new status-code indicating the trailer length issue is raised. This seems to cause the loss of the original status code.

Use case

Just to highlight one practical impact of the inability to influence the grpc status-code:
I've noticed I have to keep adding logic to the client side to correctly classify each INTERNAL error code as "retryable" or "not retryable". This is super brittle with string parsing of the error details and also very disruptive because I have to publish a new client python package and have all the users update each time a new exception type is discovered. If I could influence the grpc status code on the server side, I could just make the update there to map the internal exception type to a grcp code that implies either "retry" or "not retryable". That would make the client much simpler and not require my users to update to get the client to behave correctly wrt to retry.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksenhancementRequest for new feature and/or capabilityserveRay Serve Related Issueusability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions