Skip to content

Introducing StaleTaskError#46705

Closed
rynewang wants to merge 9 commits intoray-project:masterfrom
rynewang:always-update-seqno
Closed

Introducing StaleTaskError#46705
rynewang wants to merge 9 commits intoray-project:masterfrom
rynewang:always-update-seqno

Conversation

@rynewang
Copy link
Copy Markdown
Contributor

@rynewang rynewang commented Jul 19, 2024

To invoke methods for an actor, the caller maintains a seqno-indexed task queue to guarantee the task invocation order == task submission order. When the actor restarted, the tasks are sent to the new actor with the same seqno to keep the order.

However for an unavailable actor, after it reconnects, the caller may discover that the tasks were in fact sent to the actor and executed, just did not have a chance to reply before the connection break. Then the actor rejects those tasks because of "stale task", which means their seqno are < the current low watermark the actor is waiting for.

The caller should not always update the seqno either, because the actor may have never received the previous attempt's seqno and would wait for that, hanging forever.

The crux of the issue is that when a connection break happens and recovers, the caller has no way of knowing whether the actor received the previous task attempts or not. If received - we should update the seqno; otherwise - we should do no update.

So the only solution moving forward is to ask the actor for an answer. Hence the new protocol is: on connection break and recover, the caller always do no update when retrying the task.

  • If the actor replies "Stale Task" -> actor received the previous attempt -> caller update seqno and retry. [1]
  • If the actor replies otherwise (OK or other errors) -> the actor never received the previous attempt, treating this task as a fresh one -> just run.

One note on [1]: This makes the caller to retry another time. This should not consume a retry because it never landed to user code.

Changes:

  • Defined behavior about actor method retry vs ordering, updated docs.
  • Introduces a new internal Status of StaleTaskError that the actor sends to the caller, when it finds the seqno is already executed.
  • The caller retries the stale task with an updated seqno, without consuming a retry count.
  • Refactoring: employ RAY_LOG().WithField in many places.
  • Refactoring: make a CoreWorker::RetryTask to consolidate code in 2 places (in a wait queue vs invoke right away)
  • Test improvement: removed "log_to_driver": False in test_unavailable_actors.py

Fixes #46538.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang added the go add ONLY when ready to merge, run all tests label Jul 19, 2024
rynewang added 4 commits July 20, 2024 15:32
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang changed the title WIP always update seqno Introducing StaleTaskError Jul 22, 2024
rynewang added 3 commits July 22, 2024 15:38
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang marked this pull request as ready for review July 23, 2024 18:02
@rynewang rynewang requested review from a team, pcmoritz and raulchen as code owners July 23, 2024 18:02
@jjyao
Copy link
Copy Markdown
Contributor

jjyao commented Jul 26, 2024

There is client_processed_up_to that does the job already?

@rynewang
Copy link
Copy Markdown
Contributor Author

This depends on a clarification on the policy on method retry ordering.

Retry Case Policy Current Behavior (no update seqno on unavailable) In this PR (Stale tasks retried anew) Proposal: always update seqno on unavailable
User exception No keep order; retry at end of queue No keep order; retry at end of queue No keep order; retry at end of queue No keep order; retry at end of queue
Actor died Keep order Keep order Keep order Keep order
Actor unavailable (died) ? Keep order Keep order No keep order; retry at end of queue
Actor unavailable (recovered, never received the task) ? Keep order Keep order No keep order; retry at end of queue
Actor unavailable (recovered, received the task) Can't keep order: actor may have executed later tasks hang No keep order; retry at end of queue No keep order; retry at end of queue

@rynewang
Copy link
Copy Markdown
Contributor Author

friendly ping on this @rkooo567

@rkooo567
Copy link
Copy Markdown
Contributor

sorry eta today morning. I also should have much more bandwidth for code review from next week when I focus on one team...

Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it impossible to just always preserve the task order? I think for user perspective, the submission order they defined is the only thing that matters, and I think this should be always preserved (regardless of actor crash or connection failures). See some of my comments below (and let's discuss in person if I am missing something)

["actor", "task", "driver"],
)
@pytest.mark.skipif(sys.platform == "win32", reason="does not work on windows")
@pytest.mark.parametrize("ray_start_regular", [{"log_to_driver": False}], indirect=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is useful. it disables logs to the driver, but those logs are useful for debugging

num_oom_retries_left = it->second.num_oom_retries_left;
if (task_failed_due_to_oom) {
if (task_failed_due_to_stale_task) {
// Task failed due to stale task. This can only happen during actor unavailable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert task is an actor task?

// and when you reconnect and retry, the actor already executed the last attempt
// which we consider as "ActorUnavailable" (attempt #1), and then the caller
// retries the task with the same seqno and got "StaleTask" (attempt #2). This
// attempt #2 does not consume a retry, though it does occupy an attempt number.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should not occupy an attempt number right? For end-user perspective, this retry is internal? what's the reason behind this?

return result_runtime_env;
}

void CoreWorker::RetryTask(TaskToRetry &task_to_retry) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void CoreWorker::RetryTask(TaskToRetry &task_to_retry) {
void CoreWorker::RetryTask(const TaskToRetry &task_to_retry) {

} else if (status.IsStaleTaskError()) {
// The task is considered stale by actor. This can heppen when the actor receives
// a task, and the connection broke, and the caller resubmits the task with the
// same seqno. This task may be retried out of order as if it's retryable user
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused about this behavior. Since we reorder the execution based on seqno on the receiver side, it is not possible to retry this out of order right?

  1. If the task has been accepted before connection break, it is ignored, so it is not out-of-order retried (it is simply ignored)
  2. if the task has not been accepted before connection break, the task with next seqno shouldn't have been executed yet, so the order is preserved

am I missing something here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For case (1) it's now ignored and makes the process hang. So we need to use a new seqno to retry it out of order.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 17, 2024
@rynewang
Copy link
Copy Markdown
Contributor Author

So the whole point of this PR is: when there's a conn break, and the client retries with the same seqno, the server may find it already executed (and failed to reply). Meanwhile, some newer seqnos may already have executed so it's not possible to retry while keeping the order. Example scenario:

(in time order)

  1. Client: Task1(seqno=1), Task2(seqno=2), Task3(seqno=3) submitted
  2. Server: Executed Task1, replied.
  3. Client: bookkeep Task1 finished.
  4. Server: Executing Task2...
  5. conn break
  6. Server: Task2 executed, failed to reply, but effects are visible to users
  7. Server: Executing Task3...
  8. Server: Task3 executed, failed to reply, but effects are visible to users
  9. Client: reconnect, resend pending tasks: Task2(seqno=2), Task3(seqno=3)
  10. Server: ???

Here, the client have to resend Task2 and Task3, and it's not possible to do it "transparently" because Server already saw the prev request to Task2 and Task3 and executed them, just did not have a chance to reply. Since the effect is user visible I'd say it must consume a retry, and have to be out-of-order (a retry Task2 have to happen after Task3, since Task3 is already executed).

Let's chat offline about this

@rkooo567
Copy link
Copy Markdown
Contributor

Server: Task2 executed, failed to reply, but effects are visible to users

I think I mainly don't understand this part. Why is the effect visible to users? (does it raise an exception?)

Client: reconnect, resend pending tasks: Task2(seqno=2), Task3(seqno=3)

Also, since server already executed task 2 and 3, isn't it possible to just make it no-op, and in the user's perspecitve, isn't the ordering already guranteed? (because task 2 and 3 are already executed in the right order)

@rkooo567
Copy link
Copy Markdown
Contributor

but yeah let's talk in person tomorrow. I think it'd be easier to resolve the discussion!

@stale
Copy link
Copy Markdown

stale bot commented Feb 25, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 25, 2025
@jjyao
Copy link
Copy Markdown
Contributor

jjyao commented Apr 8, 2025

Decided to go with approach #51904

@jjyao jjyao closed this Apr 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. go add ONLY when ready to merge, run all tests stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] The actor task hangs when it is re-submitted

3 participants