[CCR] Improve shard follow task's retryable error handling#33371
Conversation
|
Pinging @elastic/es-distributed |
Improve failure handling of retryable errors by retrying remote calls in a exponential backoff like manner. The delay between a retry would not be longer than the configured max retry delay. Also retryable errors will be retried indefinitely. Relates to elastic#30086
05d0518 to
a717f44
Compare
jasontedor
left a comment
There was a problem hiding this comment.
I left a comment about the delay computation.
|
|
||
| private boolean shouldRetry(Exception e) { | ||
| static long computeDelay(int currentRetry, long maxRetryDelayInMillis) { | ||
| long expectedBackOff = Math.round(10 * Math.exp(0.8d * currentRetry) - 1); |
There was a problem hiding this comment.
This is not quite what I think of as exponential backoff. The problem that I see with an implementation like this is that we can have a thundering herd problem. If there is some failure that causes a bunch of tasks to simultaneously fail (e.g., say that we have a bunch of outstanding fetch tasks waiting for a response, and the network connection breaks, failing all of them), all of the retries will keep waking up at the same time, itself potentially causing issues due to a herd.
Typically it would be that there is a random component in exponential backoff, to avoid this herding. As a first approach, what I suggest here is: choose a random value k between 0 and 2^number of retries - 1. Then retry after k * delay seconds. We can cap this at max retry delay.
There was a problem hiding this comment.
Right, I did think about this, but when looking at ExponentialBackoffIterator class to see how it was implemented elsewhere I decided to use that. I will change this.
There was a problem hiding this comment.
I've changed how the delay is now computed. With this randomness the retying mechanism is much more solid.
|
|
||
| private boolean shouldRetry(Exception e) { | ||
| static long computeDelay(int currentRetry, long maxRetryDelayInMillis) { | ||
| long n = Math.round(Math.pow(2, currentRetry - 1)); |
There was a problem hiding this comment.
I think that we need to guard against overflow here!
| */ | ||
| public abstract class ShardFollowNodeTask extends AllocatedPersistentTask { | ||
|
|
||
| public static final SecureRandom RANDOM_INSTANCE = new SecureRandom(); |
There was a problem hiding this comment.
We can use Randomness#get here.
| "] times, aborting...", e)); | ||
| } | ||
| if (shouldRetry(e) && isStopped() == false) { | ||
| LOGGER.debug(new ParameterizedMessage("{} error during follow shard task, retrying...", params.getFollowShardId()), e); |
There was a problem hiding this comment.
We should debug log here the number of times that we have retried up to here.
…w_task_retryable_failure_handling
|
@jasontedor I've updated the PR. |
* master: (91 commits) Preserve cluster settings on full restart tests (elastic#33590) Use IndexWriter.getFlushingBytes() rather than tracking it ourselves (elastic#33582) Fix upgrading of list settings (elastic#33589) Add read-only Engine (elastic#33563) HLRC: Add ML get categories API (elastic#33465) SQL: Adds MONTHNAME, DAYNAME and QUARTER functions (elastic#33411) Add predicate_token_filter (elastic#33431) Fix Replace function. Adds more tests to all string functions. (elastic#33478) [ML] Rename input_fields to column_names in file structure (elastic#33568) Add full cluster restart base class (elastic#33577) Validate list values for settings (elastic#33503) Copy and validatie soft-deletes setting on resize (elastic#33517) Test: Fix package name SQL: Fix result column names for arithmetic functions (elastic#33500) Upgrade to latest Lucene snapshot (elastic#33505) Enable not wiping cluster settings after REST test (elastic#33575) MINOR: Remove Dead Code in SearchScript (elastic#33569) [Test] Remove duplicate method in TestShardRouting (elastic#32815) mute test on windows Update beats template to include apm-server metrics (elastic#33286) ...
|
Separately I think we want some validation on diff --git a/x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/FollowIndexAction.java b/x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/FollowIndexAction.java
index 26a9d11eb32..4b235b55991 100644
--- a/x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/FollowIndexAction.java
+++ b/x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/FollowIndexAction.java
@@ -62,6 +62,8 @@ import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicReferenceArray;
import java.util.stream.Collectors;
+import static org.elasticsearch.action.ValidateActions.addValidationError;
+
public class FollowIndexAction extends Action<AcknowledgedResponse> {
public static final FollowIndexAction INSTANCE = new FollowIndexAction();
@@ -210,9 +212,19 @@ public class FollowIndexAction extends Action<AcknowledgedResponse> {
return maxBatchOperationCount;
}
+ private static final TimeValue MAX_MAX_RETRY_DELAY = TimeValue.timeValueMinutes(5);
+
@Override
public ActionRequestValidationException validate() {
- return null;
+ ActionRequestValidationException e = null;
+ if (maxRetryDelay.millis() <= 0) {
+ e = addValidationError("max_retry_delay must be positive but was [" + maxRetryDelay.getStringRep() + "]", e);
+ }
+ if (maxRetryDelay.millis() > MAX_MAX_RETRY_DELAY.millis()) {
+ e = addValidationError(
+ "max_retry_delay must be less than [" + MAX_MAX_RETRY_DELAY + "[ but was [" + maxRetryDelay.getStringRep() + "]", e);
+ }
+ return e;
}
@OverrideWhat do you think? |
…w_task_retryable_failure_handling
Improve failure handling of retryable errors by retrying remote calls in a exponential backoff like manner. The delay between a retry would not be longer than the configured max retry delay. Also retryable errors will be retried indefinitely. Relates to #30086
* master: (43 commits) [HLRC][ML] Add ML put datafeed API to HLRC (elastic#33603) Update AWS SDK to 1.11.406 in repository-s3 (elastic#30723) Expose CCR stats to monitoring (elastic#33617) [Docs] Update match-query.asciidoc (elastic#33610) TEST: Adjust rollback condition when shard is empty [CCR] Improve shard follow task's retryable error handling (elastic#33371) Forbid negative `weight` in Function Score Query (elastic#33390) Clarify context suggestions filtering and boosting (elastic#33601) Disable CCR REST endpoints if CCR disabled (elastic#33619) Lower version on full cluster restart settings test Upgrade remote cluster settings (elastic#33537) NETWORKING: http.publish_host Should Contain CNAME (elastic#32806) Add test coverage for global checkpoint listeners Reset replica engine to global checkpoint on promotion (elastic#33473) HLRC: ML Delete Forecast API (elastic#33526) Remove debug logging in full cluster restart tests (elastic#33612) Expose CCR to the transport client (elastic#33608) Mute testIndexDeletionWhenNodeRejoins SQL: Make Literal a NamedExpression (elastic#33583) [DOCS] Adds missing built-in user information (elastic#33585) ...
* master: (128 commits) [HLRC][ML] Add ML put datafeed API to HLRC (elastic#33603) Update AWS SDK to 1.11.406 in repository-s3 (elastic#30723) Expose CCR stats to monitoring (elastic#33617) [Docs] Update match-query.asciidoc (elastic#33610) TEST: Adjust rollback condition when shard is empty [CCR] Improve shard follow task's retryable error handling (elastic#33371) Forbid negative `weight` in Function Score Query (elastic#33390) Clarify context suggestions filtering and boosting (elastic#33601) Disable CCR REST endpoints if CCR disabled (elastic#33619) Lower version on full cluster restart settings test Upgrade remote cluster settings (elastic#33537) NETWORKING: http.publish_host Should Contain CNAME (elastic#32806) Add test coverage for global checkpoint listeners Reset replica engine to global checkpoint on promotion (elastic#33473) HLRC: ML Delete Forecast API (elastic#33526) Remove debug logging in full cluster restart tests (elastic#33612) Expose CCR to the transport client (elastic#33608) Mute testIndexDeletionWhenNodeRejoins SQL: Make Literal a NamedExpression (elastic#33583) [DOCS] Adds missing built-in user information (elastic#33585) ...
Improve the failure handling of retryable errors by retrying remote calls in
a exponential backoff like manner. The delay between a retry would not be
longer than the configured max retry delay. Also retryable errors will be
retried indefinitely.
Relates to #30086