Skip to content

Ray client connect retries not sufficient: sometimes get 502 error #13446

@ericl

Description

@ericl

I don't believe the channel_ready_future() is sufficient to guarantee the connection is ready, since sometimes I still see a 502 error from the data client.

How about we issue a health check RPC and destroy/recreate the channel if that fails?

cc @barakmich

Got Error from data channel -- shutting down: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Received http2 header with status: 502"
	debug_error_string = "{"created":"@1610606646.739755640","description":"Received http2 :status header with non-200 OK status","file":"src/core/ext/filters/http/client/http_client_filter.cc","file_line":129,"grpc_message":"Received http2 header with status: 502","grpc_status":14,"value":"502"}"
>
Got Error from logger channel -- shutting down: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Received http2 header with status: 502"
	debug_error_string = "{"created":"@1610606646.741104491","description":"Received http2 :status header with non-200 OK status","file":"src/core/ext/filters/http/client/http_client_filter.cc","file_line":129,"grpc_message":"Received http2 header with status: 502","grpc_status":14,"value":"502"}"
>
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/eric/Desktop/ray/python/ray/util/client/dataclient.py", line 76, in _data_main
    raise e
  File "/home/eric/Desktop/ray/python/ray/util/client/dataclient.py", line 61, in _data_main
    for response in resp_stream:
  File "/home/eric/.local/lib/python3.6/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/home/eric/.local/lib/python3.6/site-packages/grpc/_channel.py", line 706, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Received http2 header with status: 502"
	debug_error_string = "{"created":"@1610606646.739755640","description":"Received http2 :status header with non-200 OK status","file":"src/core/ext/filters/http/client/http_client_filter.cc","file_line":129,"grpc_message":"Received http2 header with status: 502","grpc_status":14,"value":"502"}"

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions