Skip to content

[Release] Serve failure test check fails; Check failed: num_attempts < RayConfig::instance().gcs_service_connect_retries() No entry found for GcsServerAddress #8915

@rkooo567

Description

@rkooo567

What is the problem?

Serve failure test fails with a check failure Check failed: num_attempts < RayConfig::instance().gcs_service_connect_retries() No entry found for GcsServerAddress. This didn't happen before, and we could succeed to run it for many days.

Redis failed to start, retrying now.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0612 08:00:30.902362   570   570 global_state_accessor.cc:25] Redis server address = 172.31.27.64:6379, is test flag = 0
I0612 08:00:30.903203   570   570 redis_client.cc:141] RedisClient connected.
I0612 08:00:30.903295   570   570 redis_gcs_client.cc:88] RedisGcsClient Connected.
I0612 08:00:30.904644   570   570 service_based_gcs_client.cc:75] ServiceBasedGcsClient Connected.
I0612 08:00:31.630206   570   570 global_state_accessor.cc:25] Redis server address = 172.31.27.64:6379, is test flag = 0
I0612 08:00:31.642829   570   570 redis_client.cc:141] RedisClient connected.
I0612 08:00:31.643057   570   570 redis_gcs_client.cc:88] RedisGcsClient Connected.
I0612 08:00:31.643580   570   570 service_based_gcs_client.cc:75] ServiceBasedGcsClient Connected.
E0612 08:00:35.038355   570   772 task_manager.cc:301] infinite retries left for task 44ee453cd1e8e28345b95b1c0100, attempting to resubmit.
E0612 08:00:35.038854   570   772 core_worker.cc:383] Will resubmit task after a 5000ms delay: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.serve.master, class_name=ServeMaster, function_name=create_backend, function_hash=}, task_id=44ee453cd1e8e28345b95b1c0100, job_id=0100, num_args=6, num_returns=2, actor_task_spec={actor_id=45b95b1c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
E0612 08:00:40.664220   570   772 task_manager.cc:320] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.serve.master, class_name=ServeMaster, function_name=create_backend, function_hash=}, task_id=44ee453cd1e8e28345b95b1c0100, job_id=0100, num_args=6, num_returns=2, actor_task_spec={actor_id=45b95b1c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
Traceback (most recent call last):
  File "workloads/serve_failure.py", line 122, in <module>
    RandomTest(max_endpoints=(num_nodes * cpus_per_node) - 4).run()
  File "workloads/serve_failure.py", line 68, in __init__
    self.create_endpoint()
  File "workloads/serve_failure.py", line 82, in create_endpoint
    serve.create_backend(new_endpoint, handler)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/serve/api.py", line 33, in check
    return f(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/serve/api.py", line 243, in create_backend
    replica_config))
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/worker.py", line 1476, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
2020-06-12 08:00:41,418 ERROR import_thread.py:93 -- ImportThread: Connection closed by server.
2020-06-12 08:00:41,418 ERROR worker.py:1049 -- listen_error_messages_raylet: Connection closed by server.
E0612 08:00:44.189813   570   570 raylet_client.cc:90] IOError: [RayletClient] Connection closed unexpectedly. [RayletClient] Failed to disconnect from raylet.
F0612 08:00:46.667949   570   772 service_based_gcs_client.cc:104]  Check failed: num_attempts < RayConfig::instance().gcs_service_connect_retries() No entry found for GcsServerAddress
*** Check failure stack trace: ***
    @     0x7fdc627230cd  google::LogMessage::Fail()
    @     0x7fdc6272453c  google::LogMessage::SendToLog()
    @     0x7fdc62722da9  google::LogMessage::Flush()
    @     0x7fdc62722fc1  google::LogMessage::~LogMessage()
    @     0x7fdc62700899  ray::RayLog::~RayLog()
    @     0x7fdc624dc79a  ray::gcs::ServiceBasedGcsClient::GetGcsServerAddressFromRedis()
    @     0x7fdc624dc9e7  _ZNSt17_Function_handlerIFSt4pairISsiEvEZN3ray3gcs21ServiceBasedGcsClient7ConnectERN5boost4asio10io_contextEEUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7fdc62512a67  ray::rpc::GcsRpcClient::Reconnect()
    @     0x7fdc62514858  _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc19AddProfileDataReplyEEZNS4_12GcsRpcClient14AddProfileDataERKNS4_21AddProfileDataRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
    @     0x7fdc624e4fcd  ray::rpc::ClientCallImpl<>::OnReplyReceived()
    @     0x7fdc62439330  _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
    @     0x7fdc6295ab9f  boost::asio::detail::scheduler::do_run_one()
    @     0x7fdc6295b7b1  boost::asio::detail::scheduler::run()
    @     0x7fdc6295c702  boost::asio::io_context::run()
    @     0x7fdc624205d0  ray::CoreWorker::RunIOService()
    @     0x7fdc6203c678  execute_native_thread_routine_compat
    @     0x7fdc67fb56ba  start_thread
    @     0x7fdc67ceb41d  clone
/tmp/ray_command_file_11981: line 4:   570 Aborted                 (core dumped) python workloads/serve_failure.py

Reproduction (REQUIRED)

This happens within a minute at a m5.2xlarge instance
Run ./ci/long_running_test/workload/serve_failure.py

Metadata

Metadata

Assignees

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tserveRay Serve Related Issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions