Introducing StaleTaskError by rynewang · Pull Request #46705 · ray-project/ray

rynewang · 2024-07-19T07:05:16Z

To invoke methods for an actor, the caller maintains a seqno-indexed task queue to guarantee the task invocation order == task submission order. When the actor restarted, the tasks are sent to the new actor with the same seqno to keep the order.

However for an unavailable actor, after it reconnects, the caller may discover that the tasks were in fact sent to the actor and executed, just did not have a chance to reply before the connection break. Then the actor rejects those tasks because of "stale task", which means their seqno are < the current low watermark the actor is waiting for.

The caller should not always update the seqno either, because the actor may have never received the previous attempt's seqno and would wait for that, hanging forever.

The crux of the issue is that when a connection break happens and recovers, the caller has no way of knowing whether the actor received the previous task attempts or not. If received - we should update the seqno; otherwise - we should do no update.

So the only solution moving forward is to ask the actor for an answer. Hence the new protocol is: on connection break and recover, the caller always do no update when retrying the task.

If the actor replies "Stale Task" -> actor received the previous attempt -> caller update seqno and retry. [1]
If the actor replies otherwise (OK or other errors) -> the actor never received the previous attempt, treating this task as a fresh one -> just run.

One note on [1]: This makes the caller to retry another time. This should not consume a retry because it never landed to user code.

Changes:

Defined behavior about actor method retry vs ordering, updated docs.
Introduces a new internal Status of StaleTaskError that the actor sends to the caller, when it finds the seqno is already executed.
The caller retries the stale task with an updated seqno, without consuming a retry count.
Refactoring: employ RAY_LOG().WithField in many places.
Refactoring: make a CoreWorker::RetryTask to consolidate code in 2 places (in a wait queue vs invoke right away)
Test improvement: removed "log_to_driver": False in test_unavailable_actors.py

Fixes #46538.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

jjyao · 2024-07-26T17:19:03Z

There is client_processed_up_to that does the job already?

rynewang · 2024-07-31T17:42:58Z

This depends on a clarification on the policy on method retry ordering.

Retry Case	Policy	Current Behavior (no update seqno on unavailable)	In this PR (Stale tasks retried anew)	Proposal: always update seqno on unavailable
User exception	No keep order; retry at end of queue	No keep order; retry at end of queue	No keep order; retry at end of queue	No keep order; retry at end of queue
Actor died	Keep order	Keep order	Keep order	Keep order
Actor unavailable (died)	?	Keep order	Keep order	No keep order; retry at end of queue
Actor unavailable (recovered, never received the task)	?	Keep order	Keep order	No keep order; retry at end of queue
Actor unavailable (recovered, received the task)	Can't keep order: actor may have executed later tasks	hang	No keep order; retry at end of queue	No keep order; retry at end of queue

rynewang · 2024-08-15T22:04:43Z

friendly ping on this @rkooo567

rkooo567 · 2024-08-16T08:32:26Z

sorry eta today morning. I also should have much more bandwidth for code review from next week when I focus on one team...

rkooo567

Is it impossible to just always preserve the task order? I think for user perspective, the submission order they defined is the only thing that matters, and I think this should be always preserved (regardless of actor crash or connection failures). See some of my comments below (and let's discuss in person if I am missing something)

rkooo567 · 2024-08-17T05:16:37Z

python/ray/tests/test_unavailable_actors.py

    ["actor", "task", "driver"],
 )
 @pytest.mark.skipif(sys.platform == "win32", reason="does not work on windows")
-@pytest.mark.parametrize("ray_start_regular", [{"log_to_driver": False}], indirect=True)


why is it removed?

I don't think this is useful. it disables logs to the driver, but those logs are useful for debugging

rkooo567 · 2024-08-17T05:23:52Z

src/ray/core_worker/task_manager.cc

    num_oom_retries_left = it->second.num_oom_retries_left;
-    if (task_failed_due_to_oom) {
+    if (task_failed_due_to_stale_task) {
+      // Task failed due to stale task. This can only happen during actor unavailable


assert task is an actor task?

rkooo567 · 2024-08-17T05:25:06Z

src/ray/core_worker/task_manager.cc

+      // and when you reconnect and retry, the actor already executed the last attempt
+      // which we consider as "ActorUnavailable" (attempt #1), and then the caller
+      // retries the task with the same seqno and got "StaleTask" (attempt #2). This
+      // attempt #2 does not consume a retry, though it does occupy an attempt number.


it should not occupy an attempt number right? For end-user perspective, this retry is internal? what's the reason behind this?

rkooo567 · 2024-08-17T05:25:38Z

src/ray/core_worker/core_worker.cc

  return result_runtime_env;
 }

+void CoreWorker::RetryTask(TaskToRetry &task_to_retry) {


Suggested change

void CoreWorker::RetryTask(TaskToRetry &task_to_retry) {

void CoreWorker::RetryTask(const TaskToRetry &task_to_retry) {

rkooo567 · 2024-08-17T05:32:43Z

src/ray/core_worker/transport/direct_actor_task_submitter.cc

+    } else if (status.IsStaleTaskError()) {
+      // The task is considered stale by actor. This can heppen when the actor receives
+      // a task, and the connection broke, and the caller resubmits the task with the
+      // same seqno. This task may be retried out of order as if it's retryable user


I am a little confused about this behavior. Since we reorder the execution based on seqno on the receiver side, it is not possible to retry this out of order right?

If the task has been accepted before connection break, it is ignored, so it is not out-of-order retried (it is simply ignored)

if the task has not been accepted before connection break, the task with next seqno shouldn't have been executed yet, so the order is preserved

am I missing something here?

For case (1) it's now ignored and makes the process hang. So we need to use a new seqno to retry it out of order.

rynewang · 2024-08-20T21:35:25Z

So the whole point of this PR is: when there's a conn break, and the client retries with the same seqno, the server may find it already executed (and failed to reply). Meanwhile, some newer seqnos may already have executed so it's not possible to retry while keeping the order. Example scenario:

(in time order)

Client: Task1(seqno=1), Task2(seqno=2), Task3(seqno=3) submitted
Server: Executed Task1, replied.
Client: bookkeep Task1 finished.
Server: Executing Task2...
conn break
Server: Task2 executed, failed to reply, but effects are visible to users
Server: Executing Task3...
Server: Task3 executed, failed to reply, but effects are visible to users
Client: reconnect, resend pending tasks: Task2(seqno=2), Task3(seqno=3)
Server: ???

Here, the client have to resend Task2 and Task3, and it's not possible to do it "transparently" because Server already saw the prev request to Task2 and Task3 and executed them, just did not have a chance to reply. Since the effect is user visible I'd say it must consume a retry, and have to be out-of-order (a retry Task2 have to happen after Task3, since Task3 is already executed).

Let's chat offline about this

rkooo567 · 2024-08-21T08:08:00Z

Server: Task2 executed, failed to reply, but effects are visible to users

I think I mainly don't understand this part. Why is the effect visible to users? (does it raise an exception?)

Client: reconnect, resend pending tasks: Task2(seqno=2), Task3(seqno=3)

Also, since server already executed task 2 and 3, isn't it possible to just make it no-op, and in the user's perspecitve, isn't the ordering already guranteed? (because task 2 and 3 are already executed in the right order)

rkooo567 · 2024-08-21T08:08:11Z

but yeah let's talk in person tomorrow. I think it'd be easier to resolve the discussion!

stale · 2025-02-25T01:56:05Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

jjyao · 2025-04-08T22:42:42Z

Decided to go with approach #51904

WIP always update seqno

aaee461

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

rynewang added the go add ONLY when ready to merge, run all tests label Jul 19, 2024

rynewang added 4 commits July 20, 2024 15:32

wip

94e10eb

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

git

537838d

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

initial

32c9dc0

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

no retry consumption for stale task

6ecd52a

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

rynewang changed the title ~~WIP always update seqno~~ Introducing StaleTaskError Jul 22, 2024

Merge branch 'master' into always-update-seqno

664c764

rynewang mentioned this pull request Jul 22, 2024

[Core] The actor task hangs when it is re-submitted #46538

Closed

rynewang added 3 commits July 22, 2024 15:38

WithField

5fd6de7

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

remove the check

3ebb1eb

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

cpp tests and doc

723d6d9

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

rynewang marked this pull request as ready for review July 23, 2024 18:02

rynewang requested review from a team, pcmoritz and raulchen as code owners July 23, 2024 18:02

jjyao assigned rkooo567 Aug 5, 2024

rkooo567 reviewed Aug 17, 2024

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 17, 2024

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 25, 2025

kevin85421 mentioned this pull request Apr 2, 2025

[core] Avoid resubmitted actor tasks from hanging indefinitely #51904

Merged

8 tasks

jjyao closed this Apr 8, 2025

	void CoreWorker::RetryTask(TaskToRetry &task_to_retry) {
	void CoreWorker::RetryTask(const TaskToRetry &task_to_retry) {

Conversation

rynewang commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjyao commented Jul 26, 2024

Uh oh!

rynewang commented Jul 31, 2024

Uh oh!

rynewang commented Aug 15, 2024

Uh oh!

rkooo567 commented Aug 16, 2024

Uh oh!

rkooo567 left a comment

Choose a reason for hiding this comment

Uh oh!

rkooo567 Aug 17, 2024

Choose a reason for hiding this comment

Uh oh!

rynewang Aug 20, 2024

Choose a reason for hiding this comment

Uh oh!

rkooo567 Aug 17, 2024

Choose a reason for hiding this comment

Uh oh!

rkooo567 Aug 17, 2024

Choose a reason for hiding this comment

Uh oh!

rkooo567 Aug 17, 2024

Choose a reason for hiding this comment

Uh oh!

rkooo567 Aug 17, 2024

Choose a reason for hiding this comment

Uh oh!

rynewang Aug 20, 2024

Choose a reason for hiding this comment

Uh oh!

rynewang commented Aug 20, 2024

Uh oh!

rkooo567 commented Aug 21, 2024

Uh oh!

rkooo567 commented Aug 21, 2024

Uh oh!

stale bot commented Feb 25, 2025

Uh oh!

jjyao commented Apr 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rynewang commented Jul 19, 2024 •

edited

Loading