[core] Avoid resubmitted actor tasks from hanging indefinitely#51904
[core] Avoid resubmitted actor tasks from hanging indefinitely#51904jjyao merged 25 commits intoray-project:masterfrom
Conversation
|
Next steps:
|
Check out |
src/ray/core_worker/task_manager.cc
Outdated
There was a problem hiding this comment.
Why do we still keep this case instead of always updating seq_no?
There was a problem hiding this comment.
I think both should be fine. If an actor died, the new actor should not reject the tasks with the same seq_no.
There was a problem hiding this comment.
In the case of ACTOR_DIED, it would maintain the original order of execution in some cases (if multiple tasks need to get resubmitted)
Very very minor edge case though -- I would be fine with dropping this as well to unify the failure handling.
It doesn't seem to support dynamic configuration. |
What dynamic configuration you need? This is a simple chaos framework written by us so we can enhance it if it's missing features. |
|
There’s another issue related to task retrying and actor restarts. I may create a separate PR to fix it, since the current PR description already contains too much information. |
6e77fa6 to
55607b9
Compare
|
@kevin85421 @jjyao 'Driver resubmits task1 again with the same sequence number (i.e. seq_no=0)' Could the resubmission of the driver conflict with actor task failed retry(because the actor restarted) ? After I used this commit and ran a very simple ray data job(but the worker pod would frequently crash due to being preempt), an error would be throw: |
|
Could you provide a repro? Which exact commit you were using? |
what does conflict refer to? A reproduction would be helpful. |
@kevin85421 It is a simple ray data job, read_parquet -> map_batch -> map_batch -> write_parquet, with continuous worker terminate (Pods are forcibly occupied) and new worker join, stack information is :
|
|
Could you tell me the exact commit? Seems the log line number doesn't match the master. |
Some irrelevant debug logs |
|
@jjyao @kevin85421
|
Why are these changes needed?
We observe an issue that
ray.waithangs indefinitely in the following case:seq_no=0)ray/src/ray/core_worker/task_manager.cc
Lines 678 to 685 in c111d99
3 possible solutions
Solution 1: Always increment
seq_noduring resubmissionsSolution 2: Resubmit with the same seq_no first. If this fails, resubmit with a new seq_no
Solution 3: Actor caches the objects until the driver says that the objects are unnecessary.
Conclusion
3 solutions don't promise the execution order if take lineage reconstruction into consideration. Choose Solution 1 because it is the simplest.
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.To speed up the reproduction, you need to update the two configs:
The reproduction uses iptables to block the connections for HandlePushTask RPC between the driver and actor, so the driver can't receive the RPC reply from the actor after the network policies are applied.
The driver thinks the actor is "unavailable" because of gRPC keeplive watchdog timeout. The worst case it should happen after
grpc_client_keepalive_time_ms + grpc_client_keepalive_timeout_ms(i.e. 45s in the above case).Without this PR
ray.waitstucks forever.With this PR
The resubmitted task will be executed after the old one finished, and the driver receives 0 ~ 7 from the first task and 8 ~ 19 from the resubmitted one.