Skip to content

[Core][P0][Release Blocker] string op segfault in HandleCoreWorkerStats #8239

@rkooo567

Description

@rkooo567

What is the problem?

Nodes are rarely crashed at HandleGetCoreWorkerStats. This RPC endpoint is called when dashboard collects rayletstats from node manager. We got some report that large clusters are failed when dashboard is turned on. This could be related.

(pid=3123) *** Aborted at 1588212450 (unix time) try "date -d @1588212450" if you are using GNU date ***
(pid=3123) PC: @                0x0 (unknown)
(pid=3123) *** SIGSEGV (@0xfffffffffffffff8) received by PID 3123 (TID 0x7fa4c9801700) from PID 18446744073709551608; stack trace: ***
(pid=3123)     @     0x7fa4d5bd1890 (unknown)
(pid=3123)     @     0x7fa4d3035ac2 std::string::_Rep::_S_empty_rep()
(pid=3123)     @     0x7fa4d3036856 std::string::_Rep::_M_grab()
(pid=3123)     @     0x7fa4d303689d std::string::string()
(pid=3123)     @     0x7fa4d35e98ef ray::CoreWorker::HandleGetCoreWorkerStats()
(pid=3123)     @     0x7fa4d35f758c _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc14ServerCallImplINS4_24CoreWorkerServiceHandlerENS4_25GetCoreWorkerStatsRequestENS4_23GetCoreWorkerStatsReplyEE13HandleRequestEvEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(pid=3123)     @     0x7fa4d3a84fef boost::asio::detail::scheduler::do_run_one()
(pid=3123)     @     0x7fa4d3a85b71 boost::asio::detail::scheduler::run()
(pid=3123)     @     0x7fa4d3a86932 boost::asio::io_context::run()
(pid=3123)     @     0x7fa4d35cebb0 ray::CoreWorker::RunIOService()
(pid=3123)     @     0x7fa4d302bc5c execute_native_thread_routine_compat
(pid=3123)     @     0x7fa4d5bc66db start_thread
(pid=3123)     @     0x7fa4d58ef88f clone
(pid=raylet) E0430 02:07:30.277542  3059 node_manager.cc:3533] Failed to send get core worker stats request: IOError: 14: Socket closed
(pid=3153) E0430 02:07:30.294281  7976 task_manager.cc:288] 3 retries left for task df0096a15a940274ffffffff0100, attempting to resubmit.
(pid=3153) E0430 02:07:30.294970  7976 core_worker.cc:373] Will resubmit task after a 5000ms delay: Type=NORMAL_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.ray_perf, class_name=, function_name=small_value, function_hash=f888646772c4f36f030d4726c9acaf8d15462753}, task_id=df0096a15a940274ffffffff0100, job_id=0100, num_args=0, num_returns=1
2020-04-30 02:07:31,851 WARNING worker.py:1090 -- A worker died or was killed while executing task 444dbb1944f09a60ffffffff0100.

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Unfortunately, reproduction is very hard. It happens sometimes at long running tests. It has been discovered twice while doing bunch of release tests. I assume this error occurs highly likely when we run a large cluster with dashboard on.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Labels

bugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions