Skip to content

Retry failed peer recovery due to transient errors#55353

Merged
Tim-Brooks merged 32 commits intoelastic:masterfrom
Tim-Brooks:retry_peer_recovery_failures_due_to_overload
Apr 21, 2020
Merged

Retry failed peer recovery due to transient errors#55353
Tim-Brooks merged 32 commits intoelastic:masterfrom
Tim-Brooks:retry_peer_recovery_failures_due_to_overload

Conversation

@Tim-Brooks
Copy link
Copy Markdown
Contributor

Currently a failed peer recovery action will fail an recovery. This
includes when the recovery fails due to potentially short lived
transient issues such as rejected exceptions or circuit breaking
errors.

This commit adds the concept of a retryable action. A retryable action
will be retryed in face of certain errors. The action will be retried
after an exponentially increasing backoff period. After defined time,
the action will timeout.

This commit only implements retries for responses that indicate the
target node has NOT executed the action.

@Tim-Brooks Tim-Brooks added >non-issue :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.8.0 labels Apr 16, 2020
@Tim-Brooks Tim-Brooks requested review from dnhatn and ywelsch April 16, 2020 21:22
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Recovery)

Copy link
Copy Markdown
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will be really useful. I have left some comments. Thanks Tim!

* see how many translog ops we accumulate while copying files across the network. A future optimization
* would be in to restart file copy again (new deltas) if we have too many translog ops are piling up.
*/
final RecoveryFileChunkRequest request = new RecoveryFileChunkRequest(recoveryId, shardId, fileMetadata,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use a separate buffer for each chunk request; otherwise, we will resend a chunk request with data from the other chunk. Maybe use a pool so that we do not have to allocate the buffers all the time.

I ran the new test 100 iterations, but all of them passed. I think we should flush shards before starting recovery and aggressively change the recovery settings chunk_size_setting and max_concurrent_file_chunks in the test so that we can catch the error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to get consistent failures by increasing the documents and flushing in the middle.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explored removing the shared ~512K buffer. But I think that is so deeply embedded in the multi file transfer thing, that it is probably a PR of its own. I just copied to a pooled big array.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this implementation in this PR, and defer the optimization in a follow-up. I hope it won't slow down the recovery.

* see how many translog ops we accumulate while copying files across the network. A future optimization
* would be in to restart file copy again (new deltas) if we have too many translog ops are piling up.
*/
final RecoveryFileChunkRequest request = new RecoveryFileChunkRequest(recoveryId, shardId, fileMetadata,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@Tim-Brooks Tim-Brooks merged commit 4ed0dc8 into elastic:master Apr 21, 2020
Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this pull request Apr 28, 2020
Currently a failed peer recovery action will fail an recovery. This
includes when the recovery fails due to potentially short lived
transient issues such as rejected exceptions or circuit breaking
errors.

This commit adds the concept of a retryable action. A retryable action
will be retryed in face of certain errors. The action will be retried
after an exponentially increasing backoff period. After defined time,
the action will timeout.

This commit only implements retries for responses that indicate the
target node has NOT executed the action.
dnhatn added a commit that referenced this pull request May 5, 2020
A follow-up of #55353 to avoid copying file chunks before sending 
them to the network layer.

Relates #55353
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request May 5, 2020
A follow-up of elastic#55353 to avoid copying file chunks before sending
them to the network layer.

Relates elastic#55353
dnhatn added a commit that referenced this pull request May 5, 2020
A follow-up of #55353 to avoid copying file chunks before sending
them to the network layer.

Relates #55353
@mfussenegger mfussenegger mentioned this pull request May 13, 2020
37 tasks
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 1, 2022
tlrx added a commit that referenced this pull request Feb 1, 2022
… settings (#83354)

The setting indices.recovery.internal_action_retry_timeout was added in 
#55353 as a dynamic setting but the necessary plumbing to make it 
dynamically updateable is not here.

Relates #55353
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 1, 2022
… settings (elastic#83354)

The setting indices.recovery.internal_action_retry_timeout was added in 
elastic#55353 as a dynamic setting but the necessary plumbing to make it 
dynamically updateable is not here.

Relates elastic#55353
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 1, 2022
… settings (elastic#83354)

The setting indices.recovery.internal_action_retry_timeout was added in
dynamically updateable is not here.

Relates elastic#55353
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 1, 2022
…list of settings (elastic#83354)

The setting indices.recovery.internal_action_retry_timeout was added in
elastic#55353 as a dynamic setting but the necessary plumbing to make it
dynamically updateable is not here.

Relates elastic#55353
Backport of elastic#83354
elasticsearchmachine pushed a commit that referenced this pull request Feb 1, 2022
… settings (#83354) (#83367)

The setting indices.recovery.internal_action_retry_timeout was added in 
#55353 as a dynamic setting but the necessary plumbing to make it 
dynamically updateable is not here.

Relates #55353
elasticsearchmachine pushed a commit that referenced this pull request Feb 1, 2022
…list of settings (#83354) (#83368)

The setting indices.recovery.internal_action_retry_timeout was added in
#55353 as a dynamic setting but the necessary plumbing to make it
dynamically updateable is not here.

Relates #55353
Backport of #83354
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v7.8.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants