Retry failed peer recovery due to transient errors#55353
Retry failed peer recovery due to transient errors#55353Tim-Brooks merged 32 commits intoelastic:masterfrom
Conversation
…ry_failures_due_to_overload
|
Pinging @elastic/es-distributed (:Distributed/Recovery) |
dnhatn
left a comment
There was a problem hiding this comment.
This change will be really useful. I have left some comments. Thanks Tim!
| * see how many translog ops we accumulate while copying files across the network. A future optimization | ||
| * would be in to restart file copy again (new deltas) if we have too many translog ops are piling up. | ||
| */ | ||
| final RecoveryFileChunkRequest request = new RecoveryFileChunkRequest(recoveryId, shardId, fileMetadata, |
There was a problem hiding this comment.
We need to use a separate buffer for each chunk request; otherwise, we will resend a chunk request with data from the other chunk. Maybe use a pool so that we do not have to allocate the buffers all the time.
I ran the new test 100 iterations, but all of them passed. I think we should flush shards before starting recovery and aggressively change the recovery settings chunk_size_setting and max_concurrent_file_chunks in the test so that we can catch the error.
There was a problem hiding this comment.
I was able to get consistent failures by increasing the documents and flushing in the middle.
There was a problem hiding this comment.
I explored removing the shared ~512K buffer. But I think that is so deeply embedded in the multi file transfer thing, that it is probably a PR of its own. I just copied to a pooled big array.
There was a problem hiding this comment.
I am fine with this implementation in this PR, and defer the optimization in a follow-up. I hope it won't slow down the recovery.
server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/support/RetryableAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/support/RetryableAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/support/RetryableAction.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/support/RetryableAction.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java
Outdated
Show resolved
Hide resolved
| * see how many translog ops we accumulate while copying files across the network. A future optimization | ||
| * would be in to restart file copy again (new deltas) if we have too many translog ops are piling up. | ||
| */ | ||
| final RecoveryFileChunkRequest request = new RecoveryFileChunkRequest(recoveryId, shardId, fileMetadata, |
server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/support/RetryableAction.java
Show resolved
Hide resolved
…ry_failures_due_to_overload
original-brownbear
left a comment
There was a problem hiding this comment.
LGTM, thanks!
…ry_failures_due_to_overload
…ry_failures_due_to_overload
Currently a failed peer recovery action will fail an recovery. This includes when the recovery fails due to potentially short lived transient issues such as rejected exceptions or circuit breaking errors. This commit adds the concept of a retryable action. A retryable action will be retryed in face of certain errors. The action will be retried after an exponentially increasing backoff period. After defined time, the action will timeout. This commit only implements retries for responses that indicate the target node has NOT executed the action.
A follow-up of elastic#55353 to avoid copying file chunks before sending them to the network layer. Relates elastic#55353
… settings (elastic#83354) The setting indices.recovery.internal_action_retry_timeout was added in elastic#55353 as a dynamic setting but the necessary plumbing to make it dynamically updateable is not here. Relates elastic#55353
… settings (elastic#83354) The setting indices.recovery.internal_action_retry_timeout was added in dynamically updateable is not here. Relates elastic#55353
…list of settings (elastic#83354) The setting indices.recovery.internal_action_retry_timeout was added in elastic#55353 as a dynamic setting but the necessary plumbing to make it dynamically updateable is not here. Relates elastic#55353 Backport of elastic#83354
Currently a failed peer recovery action will fail an recovery. This
includes when the recovery fails due to potentially short lived
transient issues such as rejected exceptions or circuit breaking
errors.
This commit adds the concept of a retryable action. A retryable action
will be retryed in face of certain errors. The action will be retried
after an exponentially increasing backoff period. After defined time,
the action will timeout.
This commit only implements retries for responses that indicate the
target node has NOT executed the action.