Skip to content

[Core] Raylet breaks when many actor tasks are submitted #10585

@wuisawesome

Description

@wuisawesome

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):
Tested with 16 core macbook pro

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

This is caused by a low ulimt but we should have a better error message.

Note that actor pool ensures that there is at most one in flight task per actor.

import ray
from ray.util import ActorPool

@ray.remote(num_cpus=0)
class DummyActor:

    def __init__(self):
        pass

    def do_stuff(self):
        pass

ray.init()

things = [x for x in range(10000)]

nworkers = int(ray.cluster_resources()['CPU']) * 4
actors = [DummyActor.remote() for _ in range(int(nworkers))]

pool = ActorPool(actors)

res = pool.map(lambda a, v: a.do_stuff.remote(), things)

for i, x in enumerate(res):
    if i % 100 == 0:
        print(x)

Output:

(pid=49596) F0904 12:07:30.034916 49596 377699776 raylet_client.cc:108]  Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: No such file or directory
(pid=49596) *** Check failure stack trace: ***
(pid=raylet) F0904 12:07:30.037915 49403 281886144 worker_pool.cc:364] Failed to start worker with return value system:24: Too many open files
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet)     @        0x1083e0112  google::LogMessage::~LogMessage()
(pid=raylet)     @        0x10837cdc5  ray::RayLog::~RayLog()
(pid=raylet)     @        0x107f6f96e  ray::raylet::WorkerPool::StartProcess()
(pid=raylet)     @        0x107f6d04f  ray::raylet::WorkerPool::StartWorkerProcess()
(pid=raylet)     @        0x107f73707  ray::raylet::WorkerPool::PopWorker()
(pid=raylet)     @        0x107ec6923  ray::raylet::NodeManager::DispatchTasks()
(pid=raylet)     @        0x107ed8b09  ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet)     @        0x107ed15cf  ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet)     @        0x107ecfa86  ray::raylet::NodeManager::ProcessClientMessage()
(pid=raylet)     @        0x107f3817a  std::__1::__function::__func<>::operator()()
(pid=raylet)     @        0x1083558ee  ray::ClientConnection::ProcessMessage()
(pid=raylet)     @        0x10835cdb0  boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(pid=raylet)     @        0x1087e830e  boost::asio::detail::scheduler::do_run_one()
(pid=raylet)     @        0x1087dbca1  boost::asio::detail::scheduler::run()
(pid=raylet)     @        0x1087dbb2c  boost::asio::io_context::run()
(pid=raylet)     @        0x107ea7d8a  main
(pid=raylet)     @     0x7fff6a420cc9  start
(pid=49566) F0904 12:07:30.043242 49566 287137216 raylet_client.cc:108]  Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: No such file or directory
(pid=49566) *** Check failure stack trace: ***
(pid=49566)     @        0x10c6f44e2  google::LogMessage::~LogMessage()
(pid=49566)     @        0x10c691745  ray::RayLog::~RayLog()
(pid=49566)     @        0x10c2a1b99  ray::raylet::RayletClient::RayletClient()
(pid=49566)     @        0x10c1d1e6a  ray::CoreWorker::CoreWorker()
(pid=49566)     @        0x10c1cfbdf  ray::CoreWorkerProcess::CreateWorker()
(pid=49566)     @        0x10c1ce913  ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=49566)     @        0x10c1cdab7  ray::CoreWorkerProcess::Initialize()
(pid=49566)     @        0x10c13c275  __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=49566)     @        0x10b85ca8f  type_call
(pid=49566)     @        0x10b7d14f3  _PyObject_FastCallKeywords
(pid=49566)     @        0x10b90ee75  call_function
(pid=49566)     @        0x10b90bb92  _PyEval_EvalFrameDefault
(pid=49566)     @        0x10b90046e  _PyEval_EvalCodeWithName
(pid=49566)     @        0x10b7d1a03  _PyFunction_FastCallKeywords
(pid=49566)     @        0x10b90ed67  call_function
(pid=49566)     @        0x10b90cb8d  _PyEval_EvalFrameDefault
(pid=49566)     @        0x10b90046e  _PyEval_EvalCodeWithName
(pid=49566)     @        0x10b963ce0  PyRun_FileExFlags
(pid=49566)     @        0x10b963157  PyRun_SimpleFileExFlags
(pid=49566)     @        0x10b990dc3  pymain_main
(pid=49566)     @        0x10b7a3f2d  main
(pid=49566)     @     0x7fff6a420cc9  start
(pid=49566)     @                0xb  (unknown)
(pid=49563) F0904 12:07:30.037027 49563 245689792 raylet_client.cc:108]  Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: No such file or directory
(pid=49563) *** Check failure stack trace: ***
(pid=49563)     @        0x10f06e4e2  google::LogMessage::~LogMessage()
(pid=49563)     @        0x10f00b745  ray::RayLog::~RayLog()
(pid=49563)     @        0x10ec1bb99  ray::raylet::RayletClient::RayletClient()
(pid=49563)     @        0x10eb4be6a  ray::CoreWorker::CoreWorker()
(pid=49563)     @        0x10eb49bdf  ray::CoreWorkerProcess::CreateWorker()
(pid=49563)     @        0x10eb48913  ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=49563)     @        0x10eb47ab7  ray::CoreWorkerProcess::Initialize()
(pid=49563)     @        0x10eab6275  __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=49563)     @        0x10ded3a8f  type_call
(pid=49563)     @        0x10de484f3  _PyObject_FastCallKeywords
(pid=49563)     @        0x10df85e75  call_function
(pid=49563)     @        0x10df82b92  _PyEval_EvalFrameDefault
(pid=49563)     @        0x10df7746e  _PyEval_EvalCodeWithName
(pid=49563)     @        0x10de48a03  _PyFunction_FastCallKeywords
(pid=49563)     @        0x10df85d67  call_function
(pid=49563)     @        0x10df83b8d  _PyEval_EvalFrameDefault
(pid=49563)     @        0x10df7746e  _PyEval_EvalCodeWithName
(pid=49563)     @        0x10dfdace0  PyRun_FileExFlags
(pid=49563)     @        0x10dfda157  PyRun_SimpleFileExFlags
(pid=49563)     @        0x10e007dc3  pymain_main
(pid=49563)     @        0x10de1af2d  main
(pid=49563)     @     0x7fff6a420cc9  start
(pid=49512) E0904 12:07:30.084020 49512 218574848 core_worker.cc:694] Raylet failed. Shutting down.
(pid=49519) E0904 12:07:30.083940 49519 101613568 core_worker.cc:694] Raylet failed. Shutting down.
(pid=49521) E0904 12:07:30.083511 49521 5890048 core_worker.cc:694] Raylet failed. Shutting down.
(pid=49562) F0904 12:07:30.097999 49562 180841920 core_worker.cc:330]  Check failed: _s.ok() Bad status: IOError: Broken pipe
(pid=49562) *** Check failure stack trace: ***
(pid=49562)     @        0x101a884e2  google::LogMessage::~LogMessage()
(pid=49562)     @        0x101a25745  ray::RayLog::~RayLog()
(pid=49562)     @        0x1015661df  ray::CoreWorker::CoreWorker()
(pid=49562)     @        0x101563bdf  ray::CoreWorkerProcess::CreateWorker()
(pid=49562)     @        0x101562913  ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=49562)     @        0x101561ab7  ray::CoreWorkerProcess::Initialize()
(pid=49562)     @        0x1014d0275  __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=49562)     @        0x100bf0a8f  type_call
(pid=49562)     @        0x100b654f3  _PyObject_FastCallKeywords
(pid=49562)     @        0x100ca2e75  call_function
(pid=49562)     @        0x100c9fb92  _PyEval_EvalFrameDefault
(pid=49562)     @        0x100c9446e  _PyEval_EvalCodeWithName
(pid=49562)     @        0x100b65a03  _PyFunction_FastCallKeywords
(pid=49562)     @        0x100ca2d67  call_function
(pid=49562)     @        0x100ca0b8d  _PyEval_EvalFrameDefault
(pid=49562)     @        0x100c9446e  _PyEval_EvalCodeWithName
(pid=49562)     @        0x100cf7ce0  PyRun_FileExFlags
(pid=49562)     @        0x100cf7157  PyRun_SimpleFileExFlags
(pid=49562)     @        0x100d24dc3  pymain_main
(pid=49562)     @        0x100b37f2d  main
(pid=49562)     @     0x7fff6a420cc9  start
(pid=49562)     @                0xb  (unknown)

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn't

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions