Skip to content

[Multitenancy] Job ID not found in Raylet for Detached Worker #11437

@simon-mo

Description

@simon-mo

What is the problem?

@stephanie-wang and I discovered this problem while debugging a tutorial on Serve.

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

  1. RAY_BACKEND_LOG_LEVEL=debug ray start --head
  2. Run this file as a Python script and make sure it exit
import ray

ray.init(address="auto")

@ray.remote
class Controller:
    def __init__(self):
        self.workers = []

    def list(self):
        return self.workers

    def start(self, actor_class):
        worker = actor_class.options(lifetime="detached").remote()
        ray.get(worker.ping.remote())
        self.workers.append(worker)

ctl = Controller.options(lifetime="detached", name="controller").remote()
print(f"Worker handles {ray.get(ctl.list.remote())}")
  1. Run this file as a separate script, it will block forever.
import ray

ray.init(address="auto")

ctl = ray.get_actor("controller")

@ray.remote
class Worker:
    def __init__(self):
        print("Worker started!")

    def ping(self): return "pong"

ray.get(ctl.start.remote(Worker))
print("worker", ray.get(ctl.list.remote()))
  1. Now it hangs, let's ctrl+C and go to ray log and checkout raylet.out: grep --ignore-case job raylet.out
I1016 08:54:55.621362 70924 193117632 service_based_accessor.cc:92] Reestablishing subscription for job info.
I1016 08:54:55.638834 70924 193117632 service_based_accessor.cc:110] Getting all job info.
I1016 08:54:55.639565 70924 193117632 service_based_accessor.cc:117] Finished getting all job info.
I1016 08:55:01.539261 70924 193117632 worker.cc:123] Assigned worker 01000000ffffffffffffffffffffffffffffffff to job 01000000
I1016 08:55:01.539319 70924 193117632 service_based_accessor.cc:29] Adding job, job id = 01000000, driver pid = 71004
I1016 08:55:01.539764 70924 193117632 service_based_accessor.cc:39] Finished adding job, status = OK, job id = 01000000, driver pid = 71004
I1016 08:55:01.539840 70924 193117632 node_manager.cc:336] HandleJobStarted 01000000
I1016 08:55:01.546902 70924 193117632 service_based_accessor.cc:1150] Adding object location, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.547286 70924 193117632 service_based_accessor.cc:1205] Removing object location, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.547652 70924 193117632 service_based_accessor.cc:1166] Finished adding object location, status = OK, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.548303 70924 193117632 service_based_accessor.cc:1220] Finished removing object location, status = OK, object id = 4ea01d51bed991a789781a50e0e8dbfd00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = e0e8dbfd
I1016 08:55:01.556632 70924 193117632 service_based_accessor.cc:850] Adding task, task id = ffffffffffffffffdf5a1a8201000000, job id = 01000000
I1016 08:55:01.556875 70924 193117632 node_manager.cc:2256] Submitting task: task_spec={Type=ACTOR_CREATION_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Controller, function_name=__init__, function_hash=8ba8d4fb-0588-4001-a7fe-2772faf99d8c}, task_id=ffffffffffffffffdf5a1a8201000000, task_name=controller:Controller.__init__, job_id=01000000, num_args=0, num_returns=1, actor_creation_task_spec={actor_id=df5a1a8201000000, max_restarts=0, max_concurrency=1, is_asyncio_actor=0, is_detached=1}}, task_execution_spec={num_forwards=0}
I1016 08:55:01.558889 70924 193117632 worker_pool.cc:325] Worker process 71007 is bound to job 01000000
I1016 08:55:01.559113 70924 193117632 service_based_accessor.cc:859] Finished adding task, status = OK, task id = ffffffffffffffffdf5a1a8201000000, job id = 01000000
I1016 08:55:01.974562 70924 193117632 worker.cc:123] Assigned worker e32a2b267032d7c9485c534d7f32e1dce8eacb4d to job 01000000
I1016 08:55:01.983006 70924 193117632 service_based_accessor.cc:1150] Adding object location, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:01.983310 70924 193117632 service_based_accessor.cc:1205] Removing object location, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:01.983757 70924 193117632 service_based_accessor.cc:1166] Finished adding object location, status = OK, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:01.984107 70924 193117632 service_based_accessor.cc:1220] Finished removing object location, status = OK, object id = eb75d8974441f778201a0155bf38150000000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = bf381500
I1016 08:55:02.493301 70924 193117632 service_based_accessor.cc:48] Marking job state, job id = 01000000
I1016 08:55:02.493429 70924 193117632 node_manager.cc:1474] Driver (pid=71004) is disconnected. job_id: 01000000
I1016 08:55:02.493961 70924 193117632 service_based_accessor.cc:57] Finished marking job state, status = OK, job id = 01000000
I1016 08:55:02.494050 70924 193117632 node_manager.cc:353] HandleJobFinished 01000000
I1016 08:55:07.481276 70924 193117632 worker.cc:123] Assigned worker 02000000ffffffffffffffffffffffffffffffff to job 02000000
I1016 08:55:07.481319 70924 193117632 service_based_accessor.cc:29] Adding job, job id = 02000000, driver pid = 71081
I1016 08:55:07.481721 70924 193117632 service_based_accessor.cc:39] Finished adding job, status = OK, job id = 02000000, driver pid = 71081
I1016 08:55:07.481789 70924 193117632 node_manager.cc:336] HandleJobStarted 02000000
I1016 08:55:07.489056 70924 193117632 service_based_accessor.cc:1150] Adding object location, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.489295 70924 193117632 service_based_accessor.cc:1205] Removing object location, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.489627 70924 193117632 service_based_accessor.cc:1166] Finished adding object location, status = OK, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.490015 70924 193117632 service_based_accessor.cc:1220] Finished removing object location, status = OK, object id = 6b427e06817f752e052c96de5df7830e00000000, node id = 16895b921f968d2a16359ac3e9de9711c51a20a1, job id = 5df7830e
I1016 08:55:07.500048 70924 193117632 service_based_accessor.cc:850] Adding task, task id = ffffffffffffffff7fffa98101000000, job id = 01000000
I1016 08:55:07.500170 70924 193117632 node_manager.cc:2256] Submitting task: task_spec={Type=ACTOR_CREATION_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Worker, function_name=__init__, function_hash=c66af86d-070e-4c79-a178-29e8ac074eb6}, task_id=ffffffffffffffff7fffa98101000000, task_name=Worker.__init__(), job_id=01000000, num_args=0, num_returns=1, actor_creation_task_spec={actor_id=7fffa98101000000, max_restarts=0, max_concurrency=1, is_asyncio_actor=0, is_detached=1}}, task_execution_spec={num_forwards=0}
I1016 08:55:07.500252 70924 193117632 worker_pool.cc:180] Job config of job 01000000 are not local yet.
I1016 08:55:07.500522 70924 193117632 service_based_accessor.cc:859] Finished adding task, status = OK, task id = ffffffffffffffff7fffa98101000000, job id = 01000000
I1016 08:55:15.822947 70924 193117632 service_based_accessor.cc:1378] Publishing job error, job id = 01000000
I1016 08:55:15.822988 70924 193117632 service_based_accessor.cc:1381] Finished publishing job error, job id = 01000000
I1016 08:59:19.770591 70924 193117632 service_based_accessor.cc:48] Marking job state, job id = 02000000
I1016 08:59:19.770685 70924 193117632 node_manager.cc:1474] Driver (pid=71081) is disconnected. job_id: 02000000
I1016 08:59:19.771344 70924 193117632 service_based_accessor.cc:57] Finished marking job state, status = OK, job id = 02000000
I1016 08:59:19.771391 70924 193117632 node_manager.cc:353] HandleJobFinished 02000000

Notice the

I1016 08:55:07.500252 70924 193117632 worker_pool.cc:180] Job config of job 01000000 are not local yet.

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn'trelease-blockerP0 Issue that blocks the release

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions