-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't
Description
What is the problem?
Nodes are rarely crashed at HandleGetCoreWorkerStats. This RPC endpoint is called when dashboard collects rayletstats from node manager. We got some report that large clusters are failed when dashboard is turned on. This could be related.
(pid=3123) *** Aborted at 1588212450 (unix time) try "date -d @1588212450" if you are using GNU date ***
(pid=3123) PC: @ 0x0 (unknown)
(pid=3123) *** SIGSEGV (@0xfffffffffffffff8) received by PID 3123 (TID 0x7fa4c9801700) from PID 18446744073709551608; stack trace: ***
(pid=3123) @ 0x7fa4d5bd1890 (unknown)
(pid=3123) @ 0x7fa4d3035ac2 std::string::_Rep::_S_empty_rep()
(pid=3123) @ 0x7fa4d3036856 std::string::_Rep::_M_grab()
(pid=3123) @ 0x7fa4d303689d std::string::string()
(pid=3123) @ 0x7fa4d35e98ef ray::CoreWorker::HandleGetCoreWorkerStats()
(pid=3123) @ 0x7fa4d35f758c _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc14ServerCallImplINS4_24CoreWorkerServiceHandlerENS4_25GetCoreWorkerStatsRequestENS4_23GetCoreWorkerStatsReplyEE13HandleRequestEvEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(pid=3123) @ 0x7fa4d3a84fef boost::asio::detail::scheduler::do_run_one()
(pid=3123) @ 0x7fa4d3a85b71 boost::asio::detail::scheduler::run()
(pid=3123) @ 0x7fa4d3a86932 boost::asio::io_context::run()
(pid=3123) @ 0x7fa4d35cebb0 ray::CoreWorker::RunIOService()
(pid=3123) @ 0x7fa4d302bc5c execute_native_thread_routine_compat
(pid=3123) @ 0x7fa4d5bc66db start_thread
(pid=3123) @ 0x7fa4d58ef88f clone
(pid=raylet) E0430 02:07:30.277542 3059 node_manager.cc:3533] Failed to send get core worker stats request: IOError: 14: Socket closed
(pid=3153) E0430 02:07:30.294281 7976 task_manager.cc:288] 3 retries left for task df0096a15a940274ffffffff0100, attempting to resubmit.
(pid=3153) E0430 02:07:30.294970 7976 core_worker.cc:373] Will resubmit task after a 5000ms delay: Type=NORMAL_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.ray_perf, class_name=, function_name=small_value, function_hash=f888646772c4f36f030d4726c9acaf8d15462753}, task_id=df0096a15a940274ffffffff0100, job_id=0100, num_args=0, num_returns=1
2020-04-30 02:07:31,851 WARNING worker.py:1090 -- A worker died or was killed while executing task 444dbb1944f09a60ffffffff0100.Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Unfortunately, reproduction is very hard. It happens sometimes at long running tests. It has been discovered twice while doing bunch of release tests. I assume this error occurs highly likely when we run a large cluster with dashboard on.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't