Retry failed peer recovery due to transient errors#55353

Merged

Tim-Brooks merged 32 commits intoelastic:masterfrom

Tim-Brooks:retry_peer_recovery_failures_due_to_overload

Apr 21, 2020

Contributor

Tim-Brooks commented Apr 16, 2020

Currently a failed peer recovery action will fail an recovery. This
includes when the recovery fails due to potentially short lived
transient issues such as rejected exceptions or circuit breaking
errors.

This commit adds the concept of a retryable action. A retryable action
will be retryed in face of certain errors. The action will be retried
after an exponentially increasing backoff period. After defined time,
the action will timeout.

This commit only implements retries for responses that indicate the
target node has NOT executed the action.

Tim-Brooks added 18 commits

April 3, 2020 17:13

WIP

53cbbdc


          Changes

395b1cf


          Merge remote-tracking branch 'upstream/master' into retry_peer_recove…

1c4b04f

…ry_failures


          Catch exceptions

2891cde


          Merge remote-tracking branch 'upstream/master' into retry_peer_recove…

7d24bc3

…ry_failures


          Merge remote-tracking branch 'upstream/master' into retry_peer_recove…

62910c7

…ry_failures


          Changes

249c210


          Work on test

8e429af


          Merge remote-tracking branch 'upstream/master' into retry_peer_recove…

4d67ad6

…ry_failures


          Changes

f36b94d


          Merge remote-tracking branch 'upstream/master' into retry_peer_recove…

5b6d2c5

…ry_failures


          Changes

aa04ad0

Fix

3e494c2


          Idempotency

e21f650


          Merge remote-tracking branch 'upstream/master' into retry_peer_recove…

65af4bc

…ry_failures

WIP

87684ab


          Merge remote-tracking branch 'upstream/master' into retry_peer_recove…

f6797c1

…ry_failures_due_to_overload


          Pull out network retries

46fbcc5

Tim-Brooks added >non-issue :Distributed/Recovery v8.0.0 v7.8.0 labels

Tim-Brooks requested review from dnhatn and ywelsch

April 16, 2020 21:22

Collaborator

elasticmachine commented Apr 16, 2020

Pinging @elastic/es-distributed (:Distributed/Recovery)

This was referenced Apr 16, 2020

Retry failed peer recovery due to transient errors #54824

Closed

Reestablish peer recovery after network errors #55274

Merged

dnhatn reviewed

View reviewed changes

Member

dnhatn left a comment

This change will be really useful. I have left some comments. Thanks Tim!

server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java Outdated

+                       * see how many translog ops we accumulate while copying files across the network. A future optimization
+                       * would be in to restart file copy again (new deltas) if we have too many translog ops are piling up.
+                       */
+                      final RecoveryFileChunkRequest request = new RecoveryFileChunkRequest(recoveryId, shardId, fileMetadata,

Member

dnhatn Apr 17, 2020

We need to use a separate buffer for each chunk request; otherwise, we will resend a chunk request with data from the other chunk. Maybe use a pool so that we do not have to allocate the buffers all the time.

I ran the new test 100 iterations, but all of them passed. I think we should flush shards before starting recovery and aggressively change the recovery settings chunk_size_setting and max_concurrent_file_chunks in the test so that we can catch the error.

Contributor

ywelsch Apr 17, 2020

good catch

Contributor Author

Tim-Brooks Apr 18, 2020

I was able to get consistent failures by increasing the documents and flushing in the middle.

Contributor Author

Tim-Brooks Apr 18, 2020

I explored removing the shared ~512K buffer. But I think that is so deeply embedded in the multi file transfer thing, that it is probably a PR of its own. I just copied to a pooled big array.

Member

dnhatn Apr 18, 2020

I am fine with this implementation in this PR, and defer the optimization in a follow-up. I hope it won't slow down the recovery.

server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/support/RetryableAction.java Outdated Show resolved Hide resolved

ywelsch reviewed

View reviewed changes

server/src/main/java/org/elasticsearch/action/support/RetryableAction.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/support/RetryableAction.java Outdated Show resolved Hide resolved

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/support/RetryableAction.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java Outdated

+                       * see how many translog ops we accumulate while copying files across the network. A future optimization
+                       * would be in to restart file copy again (new deltas) if we have too many translog ops are piling up.
+                       */
+                      final RecoveryFileChunkRequest request = new RecoveryFileChunkRequest(recoveryId, shardId, fileMetadata,

Contributor

ywelsch Apr 17, 2020

good catch

server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/support/RetryableAction.java Show resolved Hide resolved


          Merge remote-tracking branch 'upstream/master' into retry_peer_recove…

b981dd0

…ry_failures_due_to_overload

original-brownbear approved these changes

View reviewed changes

Contributor

original-brownbear left a comment

LGTM, thanks!

Tim-Brooks added 2 commits

April 21, 2020 10:40


          Merge remote-tracking branch 'upstream/master' into retry_peer_recove…

a13ebe0

…ry_failures_due_to_overload


          Merge remote-tracking branch 'upstream/master' into retry_peer_recove…

5b68b0f

…ry_failures_due_to_overload

Tim-Brooks merged commit 4ed0dc8 into elastic:master

Tim-Brooks added the backport pending label

Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this pull request


          Retry failed peer recovery due to transient errors (elastic#55353)

5ceae21

Currently a failed peer recovery action will fail an recovery. This
includes when the recovery fails due to potentially short lived
transient issues such as rejected exceptions or circuit breaking
errors.

This commit adds the concept of a retryable action. A retryable action
will be retryed in face of certain errors. The action will be retried
after an exponentially increasing backoff period. After defined time,
the action will timeout.

This commit only implements retries for responses that indicate the
target node has NOT executed the action.

Tim-Brooks added >enhancement and removed backport pending >non-issue labels

dnhatn mentioned this pull request

Avoid copying file chunks in peer covery #56072

Merged

dnhatn added a commit that referenced this pull request


          Avoid copying file chunks in peer covery (#56072)

601617a

A follow-up of #55353 to avoid copying file chunks before sending 
them to the network layer.

Relates #55353

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request


          Avoid copying file chunks in peer covery (elastic#56072)

a4279eb

A follow-up of elastic#55353 to avoid copying file chunks before sending
them to the network layer.

Relates elastic#55353

dnhatn mentioned this pull request

Avoid copying file chunks in peer covery (#56072) #56172

Merged

dnhatn added a commit that referenced this pull request


          Avoid copying file chunks in peer covery (#56072) (#56172)

60d097e

A follow-up of #55353 to avoid copying file chunks before sending
them to the network layer.

Relates #55353

dnhatn mentioned this pull request

[CI] CorruptedFileIT.testCorruptionOnNetworkLayer failure #55705

Closed

mfussenegger mentioned this pull request

ES Backports crate/crate#9796

Closed

37 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels

tlrx added a commit to tlrx/elasticsearch that referenced this pull request


          Add missing indices.recovery.internal_action_retry_timeout to list of…

7c973ea

… settings

Relates elastic#55353

tlrx mentioned this pull request

Add missing indices.recovery.internal_action_retry_timeout to list of settings #83354

Merged

tlrx added a commit that referenced this pull request


          Add missing indices.recovery.internal_action_retry_timeout to list of…

bc23bdd

… settings (#83354)

The setting indices.recovery.internal_action_retry_timeout was added in 
#55353 as a dynamic setting but the necessary plumbing to make it 
dynamically updateable is not here.

Relates #55353

tlrx added a commit to tlrx/elasticsearch that referenced this pull request


          Add missing indices.recovery.internal_action_retry_timeout to list of…

ea34a06

… settings (elastic#83354)

The setting indices.recovery.internal_action_retry_timeout was added in 
elastic#55353 as a dynamic setting but the necessary plumbing to make it 
dynamically updateable is not here.

Relates elastic#55353

tlrx mentioned this pull request

[8.0] Add missing indices.recovery.internal_action_retry_timeout to list of settings (#83354) #83367

Merged

tlrx added a commit to tlrx/elasticsearch that referenced this pull request


          Add missing indices.recovery.internal_action_retry_timeout to list of…

b0cca8e

… settings (elastic#83354)

The setting indices.recovery.internal_action_retry_timeout was added in
dynamically updateable is not here.

Relates elastic#55353

tlrx added a commit to tlrx/elasticsearch that referenced this pull request


          [7.17] Add missing indices.recovery.internal_action_retry_timeout to …

438e576

…list of settings (elastic#83354)

The setting indices.recovery.internal_action_retry_timeout was added in
elastic#55353 as a dynamic setting but the necessary plumbing to make it
dynamically updateable is not here.

Relates elastic#55353
Backport of elastic#83354

tlrx mentioned this pull request

[7.17] Add missing indices.recovery.internal_action_retry_timeout to list of settings #83368

Merged

elasticsearchmachine pushed a commit that referenced this pull request


          Add missing indices.recovery.internal_action_retry_timeout to list of…

70bd107

… settings (#83354) (#83367)

The setting indices.recovery.internal_action_retry_timeout was added in 
#55353 as a dynamic setting but the necessary plumbing to make it 
dynamically updateable is not here.

Relates #55353

elasticsearchmachine pushed a commit that referenced this pull request


          [7.17] Add missing indices.recovery.internal_action_retry_timeout to …

8d6878e

…list of settings (#83354) (#83368)

The setting indices.recovery.internal_action_retry_timeout was added in
#55353 as a dynamic setting but the necessary plumbing to make it
dynamically updateable is not here.

Relates #55353
Backport of #83354

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Recovery >enhancement v7.8.0 v8.0.0-alpha1