Conversation
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
|
There is |
|
This depends on a clarification on the policy on method retry ordering.
|
|
friendly ping on this @rkooo567 |
|
sorry eta today morning. I also should have much more bandwidth for code review from next week when I focus on one team... |
rkooo567
left a comment
There was a problem hiding this comment.
Is it impossible to just always preserve the task order? I think for user perspective, the submission order they defined is the only thing that matters, and I think this should be always preserved (regardless of actor crash or connection failures). See some of my comments below (and let's discuss in person if I am missing something)
| ["actor", "task", "driver"], | ||
| ) | ||
| @pytest.mark.skipif(sys.platform == "win32", reason="does not work on windows") | ||
| @pytest.mark.parametrize("ray_start_regular", [{"log_to_driver": False}], indirect=True) |
There was a problem hiding this comment.
I don't think this is useful. it disables logs to the driver, but those logs are useful for debugging
| num_oom_retries_left = it->second.num_oom_retries_left; | ||
| if (task_failed_due_to_oom) { | ||
| if (task_failed_due_to_stale_task) { | ||
| // Task failed due to stale task. This can only happen during actor unavailable |
There was a problem hiding this comment.
assert task is an actor task?
| // and when you reconnect and retry, the actor already executed the last attempt | ||
| // which we consider as "ActorUnavailable" (attempt #1), and then the caller | ||
| // retries the task with the same seqno and got "StaleTask" (attempt #2). This | ||
| // attempt #2 does not consume a retry, though it does occupy an attempt number. |
There was a problem hiding this comment.
it should not occupy an attempt number right? For end-user perspective, this retry is internal? what's the reason behind this?
| return result_runtime_env; | ||
| } | ||
|
|
||
| void CoreWorker::RetryTask(TaskToRetry &task_to_retry) { |
There was a problem hiding this comment.
| void CoreWorker::RetryTask(TaskToRetry &task_to_retry) { | |
| void CoreWorker::RetryTask(const TaskToRetry &task_to_retry) { |
| } else if (status.IsStaleTaskError()) { | ||
| // The task is considered stale by actor. This can heppen when the actor receives | ||
| // a task, and the connection broke, and the caller resubmits the task with the | ||
| // same seqno. This task may be retried out of order as if it's retryable user |
There was a problem hiding this comment.
I am a little confused about this behavior. Since we reorder the execution based on seqno on the receiver side, it is not possible to retry this out of order right?
- If the task has been accepted before connection break, it is ignored, so it is not out-of-order retried (it is simply ignored)
- if the task has not been accepted before connection break, the task with next seqno shouldn't have been executed yet, so the order is preserved
am I missing something here?
There was a problem hiding this comment.
For case (1) it's now ignored and makes the process hang. So we need to use a new seqno to retry it out of order.
|
So the whole point of this PR is: when there's a conn break, and the client retries with the same seqno, the server may find it already executed (and failed to reply). Meanwhile, some newer seqnos may already have executed so it's not possible to retry while keeping the order. Example scenario: (in time order)
Here, the client have to resend Task2 and Task3, and it's not possible to do it "transparently" because Server already saw the prev request to Task2 and Task3 and executed them, just did not have a chance to reply. Since the effect is user visible I'd say it must consume a retry, and have to be out-of-order (a retry Task2 have to happen after Task3, since Task3 is already executed). Let's chat offline about this |
I think I mainly don't understand this part. Why is the effect visible to users? (does it raise an exception?)
Also, since server already executed task 2 and 3, isn't it possible to just make it no-op, and in the user's perspecitve, isn't the ordering already guranteed? (because task 2 and 3 are already executed in the right order) |
|
but yeah let's talk in person tomorrow. I think it'd be easier to resolve the discussion! |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
|
Decided to go with approach #51904 |
To invoke methods for an actor, the caller maintains a seqno-indexed task queue to guarantee the task invocation order == task submission order. When the actor restarted, the tasks are sent to the new actor with the same seqno to keep the order.
However for an unavailable actor, after it reconnects, the caller may discover that the tasks were in fact sent to the actor and executed, just did not have a chance to reply before the connection break. Then the actor rejects those tasks because of "stale task", which means their seqno are < the current low watermark the actor is waiting for.
The caller should not always update the seqno either, because the actor may have never received the previous attempt's seqno and would wait for that, hanging forever.
The crux of the issue is that when a connection break happens and recovers, the caller has no way of knowing whether the actor received the previous task attempts or not. If received - we should update the seqno; otherwise - we should do no update.
So the only solution moving forward is to ask the actor for an answer. Hence the new protocol is: on connection break and recover, the caller always do no update when retrying the task.
One note on [1]: This makes the caller to retry another time. This should not consume a retry because it never landed to user code.
Changes:
RAY_LOG().WithFieldin many places.CoreWorker::RetryTaskto consolidate code in 2 places (in a wait queue vs invoke right away)"log_to_driver": Falsein test_unavailable_actors.pyFixes #46538.