Skip to content

Duplicate tasks get queued at the same raylet #4765

@stephanie-wang

Description

@stephanie-wang

Describe the problem

A common pattern in the backend is for the raylet to remove a task from the queue, schedule an async callback that has the task in its closure, then re-queue the task in the callback. This means that the task will temporarily appear to not be in the local SchedulingQueue. This can cause problems if the task gets resubmitted to the same node, since then the raylet will accept a duplicate copy of the task. There are many CHECKs elsewhere in the code that protect against duplicate tasks at the same node, so this problem often manifests as a raylet crash.

Source code / logs

This is likely causing some of the travis test timeouts for python/ray/tests/test_component_failures.py::test_raylet_failed[ray_start_cluster0]

Here is an example raylet.err log from running that test with RAY_BACKEND_LOG_LEVEL=debug. Note that task 00000000ecaca2002e7da8c800dd48fb69733872 gets forwarded to this node twice.

I0511 00:43:07.503075 10918 node_manager.cc:1155] Received forwarded task 00000000ecaca2002e7da8c800dd48fb69733872 on node 3acbb299ebb58c710779d41104dcbe19f3794bba spillback=10
I0511 00:43:07.503096 10918 node_manager.cc:1361] Submitting task: task_id=00000000ecaca2002e7da8c800dd48fb69733872, actor_id=NIL_ID, actor_creation_id=NIL_ID, actor_handle_id=NIL_ID, actor_counter=0, task_descriptor=ray.tests.test_component_failures,,g on node 3acbb299ebb58c710779d41104dcbe19f3794bba
I0511 00:43:07.503108 10918 lineage_cache.cc:226] Add waiting task 00000000ecaca2002e7da8c800dd48fb69733872 on 3acbb299ebb58c710779d41104dcbe19f3794bba
I0511 00:43:07.503129 10918 scheduling_queue.cc:324] Added task 00000000ecaca2002e7da8c800dd48fb69733872 to ready queue
I0511 00:43:07.503155 10918 node_manager.cc:1642] Assigning task 00000000ecaca2002e7da8c800dd48fb69733872 to worker with pid 10928
I0511 00:43:07.503250 10918 scheduling_queue.cc:218] Removed task 00000000ecaca2002e7da8c800dd48fb69733872 from ready queue
I0511 00:43:07.503307 10918 node_manager.cc:1140] [NodeManager] Message ForwardTaskRequest(15) from node manager
I0511 00:43:07.503331 10918 node_manager.cc:1155] Received forwarded task 00000000ecaca2002e7da8c800dd48fb69733872 on node 3acbb299ebb58c710779d41104dcbe19f3794bba spillback=8
I0511 00:43:07.503350 10918 node_manager.cc:1361] Submitting task: task_id=00000000ecaca2002e7da8c800dd48fb69733872, actor_id=NIL_ID, actor_creation_id=NIL_ID, actor_handle_id=NIL_ID, actor_counter=0, task_descriptor=ray.tests.test_component_failures,,g on node 3acbb299ebb58c710779d41104dcbe19f3794bba
I0511 00:43:07.503361 10918 lineage_cache.cc:226] Add waiting task 00000000ecaca2002e7da8c800dd48fb69733872 on 3acbb299ebb58c710779d41104dcbe19f3794bba
W0511 00:43:07.503371 10918 node_manager.cc:1378] Task 00000000ecaca2002e7da8c800dd48fb69733872 already in lineage cache. This is most likely due to reconstruction.
I0511 00:43:07.503382 10918 scheduling_queue.cc:324] Added task 00000000ecaca2002e7da8c800dd48fb69733872 to ready queue
I0511 00:43:07.503403 10918 node_manager.cc:1642] Assigning task 00000000ecaca2002e7da8c800dd48fb69733872 to worker with pid 10927
I0511 00:43:07.504166 10918 scheduling_queue.cc:218] Removed task 00000000ecaca2002e7da8c800dd48fb69733872 from ready queue
I0511 00:43:07.504207 10918 node_manager.cc:683] [Worker] Message PushProfileEventsRequest(19) from worker with PID 11024
I0511 00:43:07.505379 10918 lineage_cache.cc:272] Remove waiting task 0000000085d00d793d0998157d81ee0d05968167 on 3acbb299ebb58c710779d41104dcbe19f3794bba
I0511 00:43:07.505475 10918 object_manager.cc:68] Object added 010000005864d33cad183d5631ffafef05e5f6e1
I0511 00:43:07.505493 10918 object_directory.cc:81] Reporting object added to GCS 010000005864d33cad183d5631ffafef05e5f6e1
I0511 00:43:07.505547 10918 node_manager.cc:1998] Object local 010000005864d33cad183d5631ffafef05e5f6e1,  on 3acbb299ebb58c710779d41104dcbe19f3794bba0 tasks ready
I0511 00:43:07.505575 10918 node_manager.cc:683] [Worker] Message PushProfileEventsRequest(19) from worker with PID 10928
I0511 00:43:07.537077 10918 object_manager.cc:743] [ObjectManager] Message PullRequest(4) from object manager
I0511 00:43:07.537104 10918 object_manager.cc:345] Push on 3acbb299ebb58c710779d41104dcbe19f3794bba to 01b529e6a8730a3fc0707e78415a635c4c600ba5 of object 01000000ec679c11c8ab00fb10e45659754db323
I0511 00:43:07.537138 10924 object_manager.cc:444] ExecuteSendObject on 3acbb299ebb58c710779d41104dcbe19f3794bba to 01b529e6a8730a3fc0707e78415a635c4c600ba5 of object 01000000ec679c11c8ab00fb10e45659754db323 chunk 0
I0511 00:43:07.537652 10918 lineage_cache.cc:257] Add ready task 00000000ecaca2002e7da8c800dd48fb69733872 on 3acbb299ebb58c710779d41104dcbe19f3794bba
I0511 00:43:07.537700 10918 scheduling_queue.cc:324] Added task 00000000ecaca2002e7da8c800dd48fb69733872 to running queue
I0511 00:43:07.537726 10918 lineage_cache.cc:257] Add ready task 00000000ecaca2002e7da8c800dd48fb69733872 on 3acbb299ebb58c710779d41104dcbe19f3794bba
W0511 00:43:07.537771 10918 node_manager.cc:1727] Task 00000000ecaca2002e7da8c800dd48fb69733872 already in lineage cache. This is most likely due to reconstruction.
I0511 00:43:07.537786 10918 scheduling_queue.cc:324] Added task 00000000ecaca2002e7da8c800dd48fb69733872 to running queue
F0511 00:43:07.537799 10918 scheduling_queue.cc:55]  Check failed: task_map_.find(task_id) == task_map_.end() 
*** Check failure stack trace: ***
*** Aborted at 1557535387 (unix time) try "date -d @1557535387" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7d000002aa6) received by PID 10918 (TID 0x7efd7ac4d780) from PID 10918; stack trace: ***
    @     0x7efd7a833330 (unknown)
    @     0x7efd79a68c37 gsignal
    @     0x7efd79a6c028 abort
    @           0x4e6f09 google::logging_fail()
    @           0x4e8c3a google::LogMessage::Fail()
    @           0x4e9f43 google::LogMessage::SendToLog()
    @           0x4e8962 google::LogMessage::Flush()
    @           0x4e8b51 google::LogMessage::~LogMessage()
    @           0x4e61f2 ray::RayLog::~RayLog()
    @           0x450907 ray::raylet::TaskQueue::AppendTask()
    @           0x44f520 ray::raylet::SchedulingQueue::QueueTasks()
    @           0x46e118 _ZZN3ray6raylet11NodeManager10AssignTaskERKNS0_4TaskEENUlNS_6StatusEE_clES5_
    @           0x46e5fd _ZNSt17_Function_handlerIFvRKN3ray6StatusEEZNS0_6raylet11NodeManager10AssignTaskERKNS5_4TaskEEUlS1_E_E9_M_invokeERKSt9_Any_dataS3_
    @           0x4dfd45 _ZZN3ray16ServerConnectionIN5boost4asio5local15stream_protocolEE13DoAsyncWritesEvENKUlRKNS_6StatusEiE_clES8_i
    @           0x4dfe4f _ZZN3ray16ServerConnectionIN5boost4asio5local15stream_protocolEE13DoAsyncWritesEvENKUlRKNS1_6system10error_codeEmE0_clES9_m
    @           0x4e04ad _ZN5boost4asio6detail8write_opINS0_19basic_stream_socketINS0_5local15stream_protocolEEESt6vectorINS0_12const_bufferESaIS8_EEN9__gnu_cxx17__normal_iteratorIPKS8_SA_EENS1_14transfer_all_tEZN3ray16ServerConnectionIS5_E13DoAsyncWritesEvEUlRKNS_6system10error_codeEmE0_EclESN_mi
    @           0x4e0780 _ZN5boost4asio6detail23reactive_socket_send_opINS1_16prepared_buffersINS0_12const_bufferELm64EEENS1_8write_opINS0_19basic_stream_socketINS0_5local15stream_protocolEEESt6vectorIS4_SaIS4_EEN9__gnu_cxx17__normal_iteratorIPKS4_SD_EENS1_14transfer_all_tEZN3ray16ServerConnectionIS9_E13DoAsyncWritesEvEUlRKNS_6system10error_codeEmE0_EEE11do_completeEPvPNS1_19scheduler_operationESQ_m
    @           0x424bed boost::asio::detail::scheduler::run()
    @           0x40e1b1 main
    @     0x7efd79a53f45 __libc_start_main
    @           0x41f751 (unknown)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions