[Core][P0][Release Blocker] string op segfault in HandleCoreWorkerStats

### What is the problem?
Nodes are rarely crashed at HandleGetCoreWorkerStats. This RPC endpoint is called when dashboard collects rayletstats from node manager. We got some report that large clusters are failed when dashboard is turned on. This could be related. 

```bash
(pid=3123) *** Aborted at 1588212450 (unix time) try "date -d @1588212450" if you are using GNU date ***
(pid=3123) PC: @                0x0 (unknown)
(pid=3123) *** SIGSEGV (@0xfffffffffffffff8) received by PID 3123 (TID 0x7fa4c9801700) from PID 18446744073709551608; stack trace: ***
(pid=3123)     @     0x7fa4d5bd1890 (unknown)
(pid=3123)     @     0x7fa4d3035ac2 std::string::_Rep::_S_empty_rep()
(pid=3123)     @     0x7fa4d3036856 std::string::_Rep::_M_grab()
(pid=3123)     @     0x7fa4d303689d std::string::string()
(pid=3123)     @     0x7fa4d35e98ef ray::CoreWorker::HandleGetCoreWorkerStats()
(pid=3123)     @     0x7fa4d35f758c _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc14ServerCallImplINS4_24CoreWorkerServiceHandlerENS4_25GetCoreWorkerStatsRequestENS4_23GetCoreWorkerStatsReplyEE13HandleRequestEvEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(pid=3123)     @     0x7fa4d3a84fef boost::asio::detail::scheduler::do_run_one()
(pid=3123)     @     0x7fa4d3a85b71 boost::asio::detail::scheduler::run()
(pid=3123)     @     0x7fa4d3a86932 boost::asio::io_context::run()
(pid=3123)     @     0x7fa4d35cebb0 ray::CoreWorker::RunIOService()
(pid=3123)     @     0x7fa4d302bc5c execute_native_thread_routine_compat
(pid=3123)     @     0x7fa4d5bc66db start_thread
(pid=3123)     @     0x7fa4d58ef88f clone
(pid=raylet) E0430 02:07:30.277542  3059 node_manager.cc:3533] Failed to send get core worker stats request: IOError: 14: Socket closed
(pid=3153) E0430 02:07:30.294281  7976 task_manager.cc:288] 3 retries left for task df0096a15a940274ffffffff0100, attempting to resubmit.
(pid=3153) E0430 02:07:30.294970  7976 core_worker.cc:373] Will resubmit task after a 5000ms delay: Type=NORMAL_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.ray_perf, class_name=, function_name=small_value, function_hash=f888646772c4f36f030d4726c9acaf8d15462753}, task_id=df0096a15a940274ffffffff0100, job_id=0100, num_args=0, num_returns=1
2020-04-30 02:07:31,851 WARNING worker.py:1090 -- A worker died or was killed while executing task 444dbb1944f09a60ffffffff0100.
```
*Ray version and other system information (Python version, TensorFlow version, OS):*

### Reproduction (REQUIRED)

Unfortunately, reproduction is very hard. It happens sometimes at long running tests. It has been discovered twice while doing bunch of release tests. I assume this error occurs highly likely when we run a large cluster with dashboard on.

- [ ] I have verified my script runs in a clean environment and reproduces the issue.
- [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/latest/installation.html).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][P0][Release Blocker] string op segfault in HandleCoreWorkerStats #8239

What is the problem?

Reproduction (REQUIRED)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Core][P0][Release Blocker] string op segfault in HandleCoreWorkerStats #8239

Description

What is the problem?

Reproduction (REQUIRED)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions