Skip to content

[Core] Intermittent SIGSEGV reported for Plasma in Ray Core (NOTE: should add to nightly tests once resolved) #16342

@waleedkadous

Description

@waleedkadous

Ray version and other system information (Python version, TensorFlow version, OS):

Ray version: 1.4.0
Anyscale version: 0.4.1
MacOS version: 11.4
Python version:3.7.9

When I run the code here: https://github.com/waleedkadous/ray-decision-tree/blob/load-data-in-cluster/cart_with_tree.py

20 times, it succeeds 17 times out of 20. 3 times out of 20, with a SIGSEGV in Plasma:

(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,480 E 113 132] logging.cc:441: *** Aborted at 1623182445 (unix time) try "date -d @1623182445" if you are using GNU date ***
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,480 E 113 132] logging.cc:441: PC: @                0x0 (unknown)
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,641 E 113 132] logging.cc:441: *** SIGSEGV (@0x0) received by PID 113 (TID 0x7fb67f7fe700) from PID 0; stack trace: ***
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,871 E 113 132] logging.cc:441:     @     0x55b88cc6bc4f google::(anonymous namespace)::FailureSignalHandler()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,872 E 113 132] logging.cc:441:     @     0x7fb6994c1980 (unknown)
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,947 E 113 132] logging.cc:441:     @     0x55b88c96bac8 dlmalloc
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,948 E 113 132] logging.cc:441:     @     0x55b88c96caf0 plasma::internal_memalign()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,949 E 113 132] logging.cc:441:     @     0x55b88c953b41 plasma::PlasmaAllocator::Memalign()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,950 E 113 132] logging.cc:441:     @     0x55b88c96076e plasma::PlasmaStore::AllocateMemory()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,950 E 113 132] logging.cc:441:     @     0x55b88c960d0b plasma::PlasmaStore::CreateObject()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,951 E 113 132] logging.cc:441:     @     0x55b88c96116d plasma::PlasmaStore::HandleCreateObjectRequest()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,952 E 113 132] logging.cc:441:     @     0x55b88c965579 plasma::CreateRequestQueue::ProcessRequest()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,953 E 113 132] logging.cc:441:     @     0x55b88c966ee6 plasma::CreateRequestQueue::ProcessRequests()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,953 E 113 132] logging.cc:441:     @     0x55b88c95a626 plasma::PlasmaStore::ProcessCreateRequests()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,954 E 113 132] logging.cc:441:     @     0x55b88c96317a plasma::PlasmaStore::ProcessMessage()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,955 E 113 132] logging.cc:441:     @     0x55b88c9551ff std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:45,955 E 113 132] logging.cc:441:     @     0x55b88c974d86 _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray16ClientConnectionEElRKSt6vectorIhSaIhEEEZN6plasma6Client6CreateESt8functionIFNS1_6StatusES0_ISB_ENSA_7flatbuf11MessageTypeES8_EEON5boost4asio19basic_stream_socketINSK_7generic15stream_protocolENSK_8executorEEEEUlS3_lS8_E_E9_M_invokeERKSt9_Any_dataOS3_OlS8_
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,127 E 113 132] logging.cc:441:     @     0x55b88cc2e672 ray::ClientConnection::ProcessMessage()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,128 E 113 132] logging.cc:441:     @     0x55b88cc2b1a8 boost::asio::detail::read_op<>::operator()()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,129 E 113 132] logging.cc:441:     @     0x55b88cc2b55b boost::asio::detail::executor_function<>::do_complete()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,229 E 113 132] logging.cc:441:     @     0x55b88c8005a0 boost::asio::io_context::executor_type::dispatch<>()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,230 E 113 132] logging.cc:441:     @     0x55b88cc2bf53 boost::asio::executor::dispatch<>()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,231 E 113 132] logging.cc:441:     @     0x55b88cc2c148 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,233 E 113 132] logging.cc:441:     @     0x55b88cfe1011 boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,235 E 113 132] logging.cc:441:     @     0x55b88cfe1141 boost::asio::detail::scheduler::run()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,236 E 113 132] logging.cc:441:     @     0x55b88cfe3340 boost::asio::io_context::run()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,236 E 113 132] logging.cc:441:     @     0x55b88c9543fd plasma::PlasmaStoreRunner::Start()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,237 E 113 132] logging.cc:441:     @     0x55b88c901697 std::thread::_State_impl<>::_M_run()
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,238 E 113 132] logging.cc:441:     @     0x55b88d01e020 execute_native_thread_routine
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,238 E 113 132] logging.cc:441:     @     0x7fb6994b66db start_thread
(raylet, ip=172.31.103.6) [2021-06-08 13:00:46,238 E 113 132] logging.cc:441:     @     0x7fb69869871f clone

Reproduction (REQUIRED)

  1. Check out repo on the "load-data-in-cluster" branch (here: https://github.com/waleedkadous/ray-decision-tree/tree/load-data-in-cluster)
  2. Set RAY_ADDRESS as appropriate
  3. Run python cart_with_tree.py 20 times, at least one instance should generate the above error. For example, here's a command to run 20 times from the shell.
for x in {1..20}; do python cart_with_tree.py -n $x >& log-cart-$x.txt & done

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions