-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tserveRay Serve Related IssueRay Serve Related Issue
Description
What is the problem?
Serve failure test fails with a check failure Check failed: num_attempts < RayConfig::instance().gcs_service_connect_retries() No entry found for GcsServerAddress. This didn't happen before, and we could succeed to run it for many days.
Redis failed to start, retrying now.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0612 08:00:30.902362 570 570 global_state_accessor.cc:25] Redis server address = 172.31.27.64:6379, is test flag = 0
I0612 08:00:30.903203 570 570 redis_client.cc:141] RedisClient connected.
I0612 08:00:30.903295 570 570 redis_gcs_client.cc:88] RedisGcsClient Connected.
I0612 08:00:30.904644 570 570 service_based_gcs_client.cc:75] ServiceBasedGcsClient Connected.
I0612 08:00:31.630206 570 570 global_state_accessor.cc:25] Redis server address = 172.31.27.64:6379, is test flag = 0
I0612 08:00:31.642829 570 570 redis_client.cc:141] RedisClient connected.
I0612 08:00:31.643057 570 570 redis_gcs_client.cc:88] RedisGcsClient Connected.
I0612 08:00:31.643580 570 570 service_based_gcs_client.cc:75] ServiceBasedGcsClient Connected.
E0612 08:00:35.038355 570 772 task_manager.cc:301] infinite retries left for task 44ee453cd1e8e28345b95b1c0100, attempting to resubmit.
E0612 08:00:35.038854 570 772 core_worker.cc:383] Will resubmit task after a 5000ms delay: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.serve.master, class_name=ServeMaster, function_name=create_backend, function_hash=}, task_id=44ee453cd1e8e28345b95b1c0100, job_id=0100, num_args=6, num_returns=2, actor_task_spec={actor_id=45b95b1c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
E0612 08:00:40.664220 570 772 task_manager.cc:320] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.serve.master, class_name=ServeMaster, function_name=create_backend, function_hash=}, task_id=44ee453cd1e8e28345b95b1c0100, job_id=0100, num_args=6, num_returns=2, actor_task_spec={actor_id=45b95b1c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
Traceback (most recent call last):
File "workloads/serve_failure.py", line 122, in <module>
RandomTest(max_endpoints=(num_nodes * cpus_per_node) - 4).run()
File "workloads/serve_failure.py", line 68, in __init__
self.create_endpoint()
File "workloads/serve_failure.py", line 82, in create_endpoint
serve.create_backend(new_endpoint, handler)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/serve/api.py", line 33, in check
return f(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/serve/api.py", line 243, in create_backend
replica_config))
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/worker.py", line 1476, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
2020-06-12 08:00:41,418 ERROR import_thread.py:93 -- ImportThread: Connection closed by server.
2020-06-12 08:00:41,418 ERROR worker.py:1049 -- listen_error_messages_raylet: Connection closed by server.
E0612 08:00:44.189813 570 570 raylet_client.cc:90] IOError: [RayletClient] Connection closed unexpectedly. [RayletClient] Failed to disconnect from raylet.
F0612 08:00:46.667949 570 772 service_based_gcs_client.cc:104] Check failed: num_attempts < RayConfig::instance().gcs_service_connect_retries() No entry found for GcsServerAddress
*** Check failure stack trace: ***
@ 0x7fdc627230cd google::LogMessage::Fail()
@ 0x7fdc6272453c google::LogMessage::SendToLog()
@ 0x7fdc62722da9 google::LogMessage::Flush()
@ 0x7fdc62722fc1 google::LogMessage::~LogMessage()
@ 0x7fdc62700899 ray::RayLog::~RayLog()
@ 0x7fdc624dc79a ray::gcs::ServiceBasedGcsClient::GetGcsServerAddressFromRedis()
@ 0x7fdc624dc9e7 _ZNSt17_Function_handlerIFSt4pairISsiEvEZN3ray3gcs21ServiceBasedGcsClient7ConnectERN5boost4asio10io_contextEEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7fdc62512a67 ray::rpc::GcsRpcClient::Reconnect()
@ 0x7fdc62514858 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc19AddProfileDataReplyEEZNS4_12GcsRpcClient14AddProfileDataERKNS4_21AddProfileDataRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
@ 0x7fdc624e4fcd ray::rpc::ClientCallImpl<>::OnReplyReceived()
@ 0x7fdc62439330 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
@ 0x7fdc6295ab9f boost::asio::detail::scheduler::do_run_one()
@ 0x7fdc6295b7b1 boost::asio::detail::scheduler::run()
@ 0x7fdc6295c702 boost::asio::io_context::run()
@ 0x7fdc624205d0 ray::CoreWorker::RunIOService()
@ 0x7fdc6203c678 execute_native_thread_routine_compat
@ 0x7fdc67fb56ba start_thread
@ 0x7fdc67ceb41d clone
/tmp/ray_command_file_11981: line 4: 570 Aborted (core dumped) python workloads/serve_failure.pyReproduction (REQUIRED)
This happens within a minute at a m5.2xlarge instance
Run ./ci/long_running_test/workload/serve_failure.py
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tserveRay Serve Related IssueRay Serve Related Issue