Do not mark bulk indexing requests as retried after primary relocations by pxsalehi · Pull Request #142157 · elastic/elasticsearch

pxsalehi · 2026-02-09T16:49:56Z

I've added a flag to RetryOnPrimaryException which indicates whether the request which is to be retried might have ran or not. If true (default value and default behaviour before this PR) the request is marked as retry. If false, it is safe to run the request as if it is not a retry, and we skip marking it as a retry. This should impact only one specific path of the retries in the TransportReplicationAction.

Assisted by Cursor

Closes #141586.
Relates ES-14121

pxsalehi · 2026-02-09T16:51:28Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

+                        if (isPrimaryAction
+                            && cause.getClass() == ReplicationOperation.RetryOnPrimaryException.class
+                            && request instanceof BulkShardRequest) {


I've included that last check (that the request is a bulk shard request) just to make this very specific. Otherwise, I think it is implied.

I'd like to omit that - or if we need it, make it a protected method that is overridden for bulk shard action.

elasticsearchmachine · 2026-02-09T17:28:34Z

Hi @pxsalehi, I've created a changelog YAML for you.

…eq-is-not-retry

elasticsearchmachine · 2026-02-10T09:23:37Z

Pinging @elastic/es-distributed (Team:Distributed)

...rTest/java/org/elasticsearch/action/support/replication/RelocationCausedIndexingRetryIT.java

…eq-is-not-retry

pxsalehi · 2026-02-10T17:10:12Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

Or would it be more obviously correct, if we change the exception we throw here and check for that exception dow in this file (instead of RetryOnPrimaryException and to distinguish between the cases in org.elasticsearch.action.support.replication.ReplicationOperation#onNoLongerPrimary)? This case for example doesn't necessarily mean that the operation didn't do any changes, but then again it also means the shard copy is not authoritative on what is the latest state, AFAIU.

I would find that easier to reason about. We may then incrementally move other exceptions over to be the new "do not mark retry" exception (better name needed).

@henningandersen do you have an objection to only addressing this specific case of RetryOnPrimaryException here and not the UnavailableShardsException? We won't change the reroute phase in this case. The latter is I believe a separate issue. That limits this PR to this one change and we don't have to weed out the UnavailableShardsException cases which I think are not the main issue. I can track that in a separate issue. At the very minimum, I think they can be two separate PRs and go in independently.

Yeah, I am good with not doing anything for unavailability, seems like a very different case. We are primarily after improving the happy path.

henningandersen

Left a few comments. I did not try to deeply investigate whether all RetryOnPrimaryException should disregard the retry, let me know if you think it is the right route to go.

henningandersen · 2026-02-10T18:30:00Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

+                        if (isPrimaryAction
+                            && cause.getClass() == ReplicationOperation.RetryOnPrimaryException.class
+                            && request instanceof BulkShardRequest) {


I'd like to omit that - or if we need it, make it a protected method that is overridden for bulk shard action.

henningandersen · 2026-02-10T18:31:38Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

I would find that easier to reason about. We may then incrementally move other exceptions over to be the new "do not mark retry" exception (better name needed).

henningandersen · 2026-02-10T18:33:37Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

            setPhase(task, "waiting_for_retry");
-            request.onRetry();
+            if (markRequestAsRetry) {
+                request.onRetry();


Maybe we need to clarify the behavior on ReplicationRequest.onRetry, i.e., that it may not be called if the action did not do any real work yet.

We could also consider adding the markRequestAsRetry (under another name) boolean to the method instead, mainly to make the contract clearer (same functionality).

henningandersen · 2026-02-10T18:47:37Z

...rTest/java/org/elasticsearch/action/support/replication/RelocationCausedIndexingRetryIT.java

+        return List.of(MockTransportService.TestPlugin.class);
+    }
+
+    public void testPrimaryRelocationShouldNotMarkIndexRequestAsRetry() throws Exception {


Do we have the opposite test, i.e., that we still mark as retry when necessary?

There are also previous unit tests. I've also added and update some in TransportReplicationActionTests. The change set touches fewer patch than the last one though (see here).

henningandersen · 2026-02-10T18:51:11Z

...rTest/java/org/elasticsearch/action/support/replication/RelocationCausedIndexingRetryIT.java

+                bulk.add(prepareIndex("index1").setSource("text", randomAlphaOfLength(10)));
+            }
+            final var listener = bulk.execute();
+            continueRelocationLatch.countDown();


Should we wait for the bulk to arrive and wait for the permit - to ensure we are in the handoff case and not the reroute phase?

We should probably also have tests for the reroute phase though.

these are addressed i think now.

henningandersen · 2026-02-10T18:54:06Z

...rTest/java/org/elasticsearch/action/support/replication/RelocationCausedIndexingRetryIT.java

+            }
+            final var listener = bulk.execute();
+            continueRelocationLatch.countDown();
+            safeAwait(node1ForwardedTheBulk);


I am not sure waiting here does anything beyond waiting for the listener? We could assert that it got forwarded after the listener invocation instead, but we already check that it was received by node 2, which I think is enough.

henningandersen · 2026-02-10T19:02:25Z

...rTest/java/org/elasticsearch/action/support/replication/RelocationCausedIndexingRetryIT.java

+            final var bulkResponse = listener.get();
+            assertThat(node2ReceivedTheBulk.get(), equalTo(true));
+            assertThat(bulkResponse.hasFailures(), equalTo(false));
+            assertThat(bulkResponse.getItems().length, equalTo(indexRequestsWithId + indexRequestsWithoutId));


Should we verify that the ids returned can be looked up? I imagine that if anything went wrong with the retry mechanism, i.e., we indeed did process the same auto-id insertion twice, the lookup might fail.

henningandersen · 2026-02-10T19:19:41Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

                    try {
                        // if we got disconnected from the node, or the node / shard is not in the right state (being closed)
                        final Throwable cause = exp.unwrapCause();
+                        boolean markAsRetry = true;


Can we make the calculation here a method instead? Together with a dedicated exception that seems like it could operate purely on the exception. It can have a default return true in TransportReplicationAction and be overridden in TransportShardBulkAction to ensure we only impact that.

It can also be called inline below instead then.

…-caused-forwarded-index-req-is-not-retry

…-caused-forwarded-index-req-is-not-retry' into ps260204-relocation-caused-forwarded-index-req-is-not-retry

…-caused-forwarded-index-req-is-not-retry

pxsalehi

Thanks for the review @henningandersen. this is ready for another round.

pxsalehi · 2026-02-23T09:35:39Z

...rTest/java/org/elasticsearch/action/support/replication/RelocationCausedIndexingRetryIT.java

+        return List.of(MockTransportService.TestPlugin.class);
+    }
+
+    public void testPrimaryRelocationShouldNotMarkIndexRequestAsRetry() throws Exception {


There are also previous unit tests. I've also added and update some in TransportReplicationActionTests. The change set touches fewer patch than the last one though (see here).

pxsalehi · 2026-02-23T09:36:02Z

...rTest/java/org/elasticsearch/action/support/replication/RelocationCausedIndexingRetryIT.java

+                bulk.add(prepareIndex("index1").setSource("text", randomAlphaOfLength(10)));
+            }
+            final var listener = bulk.execute();
+            continueRelocationLatch.countDown();


these are addressed i think now.

pxsalehi · 2026-02-23T13:02:46Z

CI failures are :x-pack:plugin:esql-datasource-ndjson:qa:javaRestTest and not related.

henningandersen

LGTM.

henningandersen · 2026-02-23T13:40:09Z

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationRequest.java

-     * the first time.
+     * Called before this replication request is retried.
+     * <p>
+     * {@code markAsRetry} controls whether request should be marked as retry or not. For some retry paths (for example


Can we name this similar to the exception boolean, i.e., possiblyExecuted? I find that slightly better since onRetry with a mark boolean seems slightly confusing.

henningandersen · 2026-02-23T13:40:27Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

                                exp
                            );
-                            retry(exp);
+                            boolean markAsRetry = true;


can we keep the possiblyExecuted naming?

henningandersen · 2026-02-23T13:40:43Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

+            retry(failure, true);
+        }
+
+        void retry(Exception failure, boolean markRequestAsRetry) {


Also possiblyExecuted naming here.

…-caused-forwarded-index-req-is-not-retry

…eq-is-not-retry

…ns (elastic#142157) I've added a flag to `RetryOnPrimaryException` which indicates whether the request which is to be retried might have ran or not. If true (default value and default behaviour before this PR) the request is marked as retry. If false, it is safe to run the request as if it is not a retry, and we skip marking it as a retry. This should impact only [one specific path](https://github.com/elastic/elasticsearch/blob/f8908057ea3e95374b720c2a5f70d2598220569e/server/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java#L493) of the retries in the TransportReplicationAction. Assisted by Cursor Closes elastic#141586. Relates ES-14121

…on-sliced-reindex * upstream/main: Update docs for v9.3.1 release (elastic#142887) Update docs for v9.2.6 release (elastic#142888) Improves visibility of vector index options and inference configuration (elastic#141653) Disable CAE in microsoft-graph-authz plugin (elastic#142848) Small improvements to `GetSnapshotsIT#testAllFeatures` (elastic#142825) Fix IndexSettingsTests synthetic ID tests (elastic#142654) [Test] Unmute tests of SnapshotShutdownIT (elastic#142921) Fixing metrics_info.json kibana definition file name (elastic#142813) [Packaging] Disable glibc 2.43 malloc huge pages in Wolfi images (elastic#142894) Mute org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsTSDBSyntheticIdIntegTests testSearchableSnapshot elastic#142918 Add shard heap usage to ClusterInfo (elastic#139557) ESQL: Load script fields row-by-row (elastic#142807) ESQL: Consolidate doc values memory tracking (elastic#142816) ES-14124 Create Index Count Limit User documentation Page (elastic#142570) Add a es819 codec test to verify tryRead returns null if may contain duplicates (elastic#142409) Support arithmetic operations for dense_vectors: scalar version (elastic#141060) [Transform] Allow project_routing (elastic#142421) Refactor query rewrite async actions for knn and sparse_vector queries (elastic#142889) Do not mark bulk indexing requests as retried after primary relocations (elastic#142157)

…ns (elastic#142157) I've added a flag to `RetryOnPrimaryException` which indicates whether the request which is to be retried might have ran or not. If true (default value and default behaviour before this PR) the request is marked as retry. If false, it is safe to run the request as if it is not a retry, and we skip marking it as a retry. This should impact only [one specific path](https://github.com/elastic/elasticsearch/blob/f8908057ea3e95374b720c2a5f70d2598220569e/server/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java#L493) of the retries in the TransportReplicationAction. Assisted by Cursor Closes elastic#141586. Relates ES-14121

Do not mark bulk indexing requests as retried after primary relocations

aa328cd

pxsalehi added >enhancement :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Feb 9, 2026

elasticsearchmachine added the v9.4.0 label Feb 9, 2026

pxsalehi commented Feb 9, 2026

View reviewed changes

Update docs/changelog/142157.yaml

855f196

pxsalehi and others added 3 commits February 9, 2026 18:29

index setting

49066bd

oops

dccebc9

Merge branch 'main' into ps260204-relocation-caused-forwarded-index-r…

cf2c39c

…eq-is-not-retry

pxsalehi marked this pull request as ready for review February 10, 2026 09:23

pxsalehi requested review from fcofdez and henningandersen February 10, 2026 09:23

elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Feb 10, 2026

pxsalehi commented Feb 10, 2026

View reviewed changes

...rTest/java/org/elasticsearch/action/support/replication/RelocationCausedIndexingRetryIT.java Show resolved Hide resolved

pxsalehi and others added 2 commits February 10, 2026 12:49

Do not mark as retry on UnavailableShardsException

57b6a2c

Merge branch 'main' into ps260204-relocation-caused-forwarded-index-r…

99aaaf7

…eq-is-not-retry

pxsalehi commented Feb 10, 2026

View reviewed changes

henningandersen reviewed Feb 10, 2026

View reviewed changes

pxsalehi and others added 10 commits February 11, 2026 12:00

Merge remote-tracking branch 'upstream/main' into ps260204-relocation…

47597d8

…-caused-forwarded-index-req-is-not-retry

Merge remote-tracking branch 'upstream/main' into ps260204-relocation…

7eb3cc6

…-caused-forwarded-index-req-is-not-retry

..wip

423ca09

oops

ee11898

[CI] Auto commit changes from spotless

233d63e

clean up

6617d0d

Merge remote-tracking branch 'refs/remotes/origin/ps260204-relocation…

ec573a9

…-caused-forwarded-index-req-is-not-retry' into ps260204-relocation-caused-forwarded-index-req-is-not-retry

more

5279c41

Merge remote-tracking branch 'upstream/main' into ps260204-relocation…

12f8ab6

…-caused-forwarded-index-req-is-not-retry

update tv

efb606a

pxsalehi commented Feb 23, 2026

View reviewed changes

pxsalehi requested a review from henningandersen February 23, 2026 09:41

henningandersen approved these changes Feb 23, 2026

View reviewed changes

pxsalehi added 2 commits February 23, 2026 17:35

rename

0da7d8c

Merge remote-tracking branch 'upstream/main' into ps260204-relocation…

06c32fb

…-caused-forwarded-index-req-is-not-retry

pxsalehi enabled auto-merge (squash) February 23, 2026 16:36

Merge branch 'main' into ps260204-relocation-caused-forwarded-index-r…

c2b3e00

…eq-is-not-retry

pxsalehi merged commit e0323eb into elastic:main Feb 23, 2026
35 checks passed

Conversation

pxsalehi commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Feb 9, 2026

Uh oh!

elasticsearchmachine commented Feb 10, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi commented Feb 23, 2026

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pxsalehi commented Feb 9, 2026 •

edited

Loading

pxsalehi left a comment •

edited

Loading