[core] Avoid resubmitted actor tasks from hanging indefinitely by kevin85421 · Pull Request #51904 · ray-project/ray

kevin85421 · 2025-04-02T05:55:38Z

Why are these changes needed?

We observe an issue that ray.wait hangs indefinitely in the following case:

Driver sends a Ray task (task1) with a streaming generator enabled to actor A, and the task’s sequence number is 0.
Actor A receives task1
Actor A somehow becomes unavailable from the driver's perspective.
Driver resubmits task1 again with the same sequence number (i.e. seq_no=0)
Actor A finishes the first task1 and sends a RPC ReportGeneratorItemReturns to the driver.

The driver replies with a Status::NotFound to Actor A because the driver has already made another attempt, and the previous attempt is now outdated.

ray/src/ray/core_worker/task_manager.cc

Lines 678 to 685 in c111d99

    
           if (it->second.spec.AttemptNumber() > attempt_number) { 
        
             // Generator task reports can arrive at any time. If the first attempt 
        
             // fails, we may receive a report from the first executor after the 
        
             // second attempt has started. In this case, we should ignore the first 
        
             // attempt. 
        
             execution_signal_callback( 
        
                 Status::NotFound("Stale object reports from the previous attempt."), -1); 
        
             return false;

Actor A receives the second task1 again. However, the actor will not execute the request because the sequence number is the same.
The driver will continue resubmitting task1 with seq_no=0, while the actor repeatedly cancels the tasks.

3 possible solutions

Solution 1: Always increment `seq_no` during resubmissions

Pros:
- Simplest implementation.
- Solves a slightly smaller issue than Solution 2, but the implementation is much simpler.
Cons:
- Doesn't promise the execution order of the retryable actor tasks.
Examples
- Case 1
  - The driver submits task A (seq_no=0) to an actor, and B, C, D are in the queue.
  - Actor unavailable.
  - The actor doesn’t receive task A.
  - The network becomes healthy.
  - The driver resubmits task A (seq_no=4). Update client_processed_up_to_.
  - The actor executes B, C, D, A.
- Case 2
  - The driver submits task A (seq_no=0) to an actor, and B, C, D are in the queue.
  - The actor receives task A.
  - Actor unavailable.
  - The network becomes healthy.
  - The driver resubmits task A (seq_no=4).
  - The actor executes A, B, C, D, A.

Solution 2: Resubmit with the same seq_no first. If this fails, resubmit with a new seq_no

Pros:
- If the actor doesn't receive any tasks, the task execution order remains the same for resubmission.
Cons:
- Doesn't promise the execution order of the actor tasks if the actor has already received some tasks before network issues.
- The complexity is much higher than Solution 1. There has been a PR tried to fix it, but the PR failed to get merged (Introducing StaleTaskError #46705).
Examples
- Case 1
  - The driver submits task A (seq_no=0) to an actor, and B, C, D are in the queue.
  - Actor unavailable.
  - The actor doesn’t receive task A.
  - The network becomes healthy.
  - The driver resubmits task A (seq_no=0).
  - The actor executes A, B, C, D.
- Case 2
  - The driver submits task A (seq_no=0) to an actor, and B, C, D are in the queue.
  - The actor receives task A.
  - Actor unavailable.
  - The network becomes healthy.
  - The actor resubmits task A (seq_no=0).
  - The actor cancels the second task A.
  - The actor resubmits task A (seq_no=4).
  - The actor executes A, B, C, D, A.

Solution 3: Actor caches the objects until the driver says that the objects are unnecessary.

Pros
- Promise the execution order
Cons
- Complex
- It's possible to increase the possibility of an OOM error because actors need to cache the results. This could be solved by adding some kind of backpressure (if there are >N results pending due to network instability, pause generator execution). However, it increases the complexity further.
Examples
- Case 1
  - The driver submits task A (seq_no=0) to an actor, and B, C, D are in the queue.
  - Actor unavailable.
  - The actor doesn’t receive task A.
  - The network becomes healthy.
  - The driver resubmits task A (seq_no=0).
  - The actor executes A, B, C, D.
- Case 2
  - The driver submits task A (seq_no=0) to an actor, and B, C, D are in the queue.
  - The actor receives task A.
  - Actor unavailable.
  - The network becomes healthy.
  - The driver resubmits task A (seq_no=0).
  - The actor executes A, B, C, D. When the actor receives the second task A, the actor returns the cache result to the driver.

Conclusion

3 solutions don't promise the execution order if take lineage reconstruction into consideration. Choose Solution 1 because it is the simplest.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

To speed up the reproduction, you need to update the two configs:

// old: 300s, new: 30s
RAY_CONFIG(int64_t, grpc_client_keepalive_time_ms, 30000)
// old: 120s, new: 15s
RAY_CONFIG(int64_t, grpc_client_keepalive_timeout_ms, 15000)

The reproduction uses iptables to block the connections for HandlePushTask RPC between the driver and actor, so the driver can't receive the RPC reply from the actor after the network policies are applied.

The driver thinks the actor is "unavailable" because of gRPC keeplive watchdog timeout. The worst case it should happen after grpc_client_keepalive_time_ms + grpc_client_keepalive_timeout_ms (i.e. 45s in the above case).

RAY_BACKEND_LOG_LEVEL=debug python3 test.py
./network_policy.sh -A

# Wait until actor unavailable. Run the following command until the core-driver log file has some related logs
grep -rn "UNAVAIL"

# Remove network policy
./network_policy.sh -D

# Wait until the first actor task finishes and the io context starts to handle the resubmit task

Without this PR

ray.wait stucks forever.

With this PR

The resubmitted task will be executed after the old one finished, and the driver receives 0 ~ 7 from the first task and 8 ~ 19 from the resubmitted one.

kevin85421 · 2025-04-02T09:01:26Z

Next steps:

Be more conservative to fail in-flight requests.
Consider shutdown the running task if its output will no longer be consumed
How to add a test? Currently, it requires to setup iptables.

jjyao · 2025-04-02T14:23:27Z

How to add a test? Currently, it requires to setup iptables.

Check out rpc_chaos.cc

alexeykudinkin · 2025-04-02T17:35:28Z

src/ray/core_worker/task_manager.cc

Why do we still keep this case instead of always updating seq_no?

I think both should be fine. If an actor died, the new actor should not reject the tasks with the same seq_no.

In the case of ACTOR_DIED, it would maintain the original order of execution in some cases (if multiple tasks need to get resubmitted)

Very very minor edge case though -- I would be fine with dropping this as well to unify the failure handling.

kevin85421 · 2025-04-02T19:34:48Z

Check out rpc_chaos.cc

It doesn't seem to support dynamic configuration.

jjyao · 2025-04-02T23:53:54Z

It doesn't seem to support dynamic configuration.

What dynamic configuration you need? This is a simple chaos framework written by us so we can enhance it if it's missing features.

kevin85421 · 2025-04-03T04:16:10Z

There’s another issue related to task retrying and actor restarts. I may create a separate PR to fix it, since the current PR description already contains too much information.

Signed-off-by: kaihsun <kaihsun@anyscale.com>

src/ray/core_worker/transport/actor_task_submitter.cc

Signed-off-by: kaihsun <kaihsun@anyscale.com>

src/ray/rpc/worker/core_worker_client.cc

Signed-off-by: kaihsun <kaihsun@anyscale.com>

python/ray/tests/test_actor_failures.py

dragongu · 2025-04-21T02:37:25Z

@kevin85421 @jjyao 'Driver resubmits task1 again with the same sequence number (i.e. seq_no=0)'

Could the resubmission of the driver conflict with actor task failed retry(because the actor restarted) ?

After I used this commit and ran a very simple ray data job(but the worker pod would frequently crash due to being preempt), an error would be throw:
task_manager.cc:1412: Check failed: it->second.GetStatus() == rpc::TaskStatus::PENDING_NODE_ASSIGNMENT , task ID = 6ff6fe559f63b1b8b015cbfb3a695db0935ce25820000000, status = 1

jjyao · 2025-04-22T05:02:58Z

@dragongu

Could you provide a repro? Which exact commit you were using?

kevin85421 · 2025-04-22T06:12:55Z

Could the resubmission of the driver conflict with actor task failed retry(because the actor restarted) ?

what does conflict refer to?

A reproduction would be helpful.

dragongu · 2025-04-22T06:24:13Z

@dragongu

Could you provide a repro? Which exact commit you were using?

@jjyao Use the master branch

dragongu · 2025-04-22T06:31:28Z

Could the resubmission of the driver conflict with actor task failed retry(because the actor restarted) ?

what does conflict refer to?

A reproduction would be helpful.

@kevin85421
The scene is like this:

It is a simple ray data job, read_parquet -> map_batch -> map_batch -> write_parquet, with continuous worker terminate (Pods are forcibly occupied) and new worker join, stack information is :

[2025-04-20 16:41:18,448 C 818365 821947] task_manager.cc:1416: Check failed: it->second.GetStatus() == rpc::TaskStatus::PENDING_NODE_ASSIGNMENT , task ID = eccd8f51d8b690fddce51415fb4280436e4d986741000000, status = 11 /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xaf0568) [0x7f9aa0f94568] ray::core::ActorTaskSubmitter::PushActorTask() /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(_ZN3ray4core11TaskManager27MarkTaskWaitingForExecutionERKNS_6TaskIDERKNS_6NodeIDERKNS_8WorkerIDE+0x3b5) [0x7f9aa0faaf25] ray::core::TaskManager::MarkTaskWaitingForExecution() /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x111) [0x7f9aa1a6fd81] ray::RayLog::~RayLog() /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x15c7417) [0x7f9aa1a6b417] ray::operator<<() *** StackTrace Information ***

jjyao · 2025-04-22T06:34:09Z

Could you tell me the exact commit? Seems the log line number doesn't match the master.

dragongu · 2025-04-22T06:36:08Z

Could you tell me the exact commit? Seems the log line number doesn't match the master.

Some irrelevant debug logs

dragongu · 2025-04-22T07:14:04Z

@jjyao @kevin85421
Hi, I suspect that the check failed might occur in the following scenario:

The actor restarts.
The actor task fails due to the actor being unavailable.
The coreWorker retries the task in a scheduled thread: InternalHeartbeat and
the task ran successfully and updated its status to 11 (FINISHED).
The actor becomes alive, and TaskManager::MarkTaskWaitingForExecution() is executed, which then triggers an exception.

kevin85421 added the go add ONLY when ready to merge, run all tests label Apr 2, 2025

kevin85421 assigned edoakes and jjyao Apr 2, 2025

kevin85421 marked this pull request as ready for review April 2, 2025 09:02

alexeykudinkin reviewed Apr 2, 2025

View reviewed changes

kevin85421 changed the title ~~[core] Avoid resubmitted actor tasks from hanging indefinitely~~ [WIP][core] Avoid resubmitted actor tasks from hanging indefinitely Apr 3, 2025

kevin85421 marked this pull request as draft April 3, 2025 08:28

kevin85421 mentioned this pull request Apr 3, 2025

[core] Move the core worker client implementations to a separate .cc file to improve build time. #51948

Merged

8 tasks

kevin85421 added 2 commits April 3, 2025 21:48

up

d20880d

Signed-off-by: kaihsun <kaihsun@anyscale.com>

fix

55607b9

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 force-pushed the 20250329-devbox1-tmux4-ray branch from 6e77fa6 to 55607b9 Compare April 4, 2025 04:46

kevin85421 added 5 commits April 4, 2025 06:05

fix

d4f78d3

Signed-off-by: kaihsun <kaihsun@anyscale.com>

fix

e402699

Signed-off-by: kaihsun <kaihsun@anyscale.com>

add tests

f55fbb7

Signed-off-by: kaihsun <kaihsun@anyscale.com>

Merge branch 'master' into 20250329-devbox1-tmux4-ray

e0a26ae

fix

cb9dc45

Signed-off-by: kaihsun <kaihsun@anyscale.com>

alexeykudinkin reviewed Apr 4, 2025

View reviewed changes

src/ray/core_worker/transport/actor_task_submitter.cc Outdated Show resolved Hide resolved

fix

97d8308

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 commented Apr 4, 2025

View reviewed changes

src/ray/rpc/worker/core_worker_client.cc Outdated Show resolved Hide resolved

kevin85421 commented Apr 4, 2025

View reviewed changes

src/ray/rpc/worker/core_worker_client.cc Outdated Show resolved Hide resolved

kevin85421 commented Apr 4, 2025

View reviewed changes

src/ray/rpc/worker/core_worker_client.cc Outdated Show resolved Hide resolved

kevin85421 added 4 commits April 5, 2025 04:05

fix

34f398b

Signed-off-by: kaihsun <kaihsun@anyscale.com>

make test idempotent

920e058

Signed-off-by: kaihsun <kaihsun@anyscale.com>

update

f0d8a0c

Signed-off-by: kaihsun <kaihsun@anyscale.com>

update

0ff2896

Signed-off-by: kaihsun <kaihsun@anyscale.com>

jjyao enabled auto-merge (squash) April 8, 2025 22:06

jjyao reviewed Apr 8, 2025

View reviewed changes

python/ray/tests/test_actor_failures.py Show resolved Hide resolved

dayshah approved these changes Apr 8, 2025

View reviewed changes

jjyao merged commit 7131ff7 into ray-project:master Apr 8, 2025
6 checks passed

This was referenced Apr 8, 2025

Introducing StaleTaskError #46705

Closed

[Core] The actor task hangs when it is re-submitted #46538

Closed

jjyao mentioned this pull request Apr 22, 2025

[Core] task_manager.cc:1416: Check failed: it->second.GetStatus() == rpc::TaskStatus::PENDING_NODE_ASSIGNMENT #52530

Closed

kevin85421 mentioned this pull request May 7, 2025

[core] No need to resend out of order completed tasks #52833

Merged

8 tasks

jjyao mentioned this pull request May 8, 2025

[Core] Use TaskAttempt as the unique id for inflight actor task #52812

Merged

8 tasks

kevin85421 mentioned this pull request May 12, 2025

[core][refactor] Remove GetSequenceNumber #52936

Merged

8 tasks

lee1258561 mentioned this pull request May 13, 2025

[Core] Ray Data job hanging with flooded Cancelling stale RPC with seqno 125 < 127 error #50814

Closed

hainesmichaelc added the community-backlog label May 22, 2025

dayshah mentioned this pull request Jul 8, 2025

[core] Don't order retries for in-order actors to prevent deadlock #54034

Merged

jjyao mentioned this pull request Jul 11, 2025

[Core|Dataset] Ray job stuck with idle actors with no tasks #45822

Open

	if (it->second.spec.AttemptNumber() > attempt_number) {
	// Generator task reports can arrive at any time. If the first attempt
	// fails, we may receive a report from the first executor after the
	// second attempt has started. In this case, we should ignore the first
	// attempt.
	execution_signal_callback(
	Status::NotFound("Stale object reports from the previous attempt."), -1);
	return false;

Conversation

kevin85421 commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

3 possible solutions

Solution 1: Always increment seq_no during resubmissions

Solution 2: Resubmit with the same seq_no first. If this fails, resubmit with a new seq_no

Solution 3: Actor caches the objects until the driver says that the objects are unnecessary.

Conclusion

Related issue number

Checks

Without this PR

With this PR

Uh oh!

kevin85421 commented Apr 2, 2025

Uh oh!

jjyao commented Apr 2, 2025

Uh oh!

alexeykudinkin Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented Apr 2, 2025

Uh oh!

jjyao commented Apr 2, 2025

Uh oh!

kevin85421 commented Apr 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dragongu commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjyao commented Apr 22, 2025

Uh oh!

kevin85421 commented Apr 22, 2025

Uh oh!

dragongu commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dragongu commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjyao commented Apr 22, 2025

Uh oh!

dragongu commented Apr 22, 2025

Uh oh!

dragongu commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kevin85421 commented Apr 2, 2025 •

edited

Loading

Solution 1: Always increment `seq_no` during resubmissions

dragongu commented Apr 21, 2025 •

edited

Loading

dragongu commented Apr 22, 2025 •

edited

Loading

dragongu commented Apr 22, 2025 •

edited

Loading

dragongu commented Apr 22, 2025 •

edited

Loading