Prevent TransportReplicationAction to route request based on stale local routing table by ywelsch · Pull Request #16274 · elastic/elasticsearch

ywelsch · 2016-01-27T18:19:19Z

Relates to #12573

When relocating a primary shard, there is a cluster state update at the end of relocation where the active primary is switched from the relocation source to the relocation target. If relocation source receives and processes this cluster state before the relocation target, there is a time span where relocation source believes active primary to be on relocation target and relocation target believes active primary to be on relocation source. This results in index/delete/flush requests being sent back and forth and can end in an OOM on both nodes.

This PR adds a field to the index/delete/flush request that helps detect the case where we locally have stale routing information. In case this staleness is detected, we wait until we have received an up-to-date cluster state before rerouting the request.

I have included the test from #12574 in this PR to demonstrate the fix in an integration test. That integration test will not be part of the final commit, however.

ywelsch · 2016-01-27T18:25:13Z

@bleskes instead of using the cluster state version, we could as well use the index metadata version. The index metadata version is updated whenever a new shard is started (thanks to active allocation ids). wdyt?

On a related note, we could use this field as well to wait for dynamic mapping updates to be applied. (for that the update mappings api would have to return the current index metadata version).

bleskes · 2016-02-01T13:18:49Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

can we assert that the version is not set?

Got confused. It actually likely the previous routing node has an older cluster state in which we should override the existing value.. nevermind

ywelsch · 2016-02-01T17:43:44Z

@bleskes renamed the field and removed integration test.

bleskes · 2016-02-01T17:49:51Z

core/src/main/java/org/elasticsearch/action/support/replication/ReplicationRequest.java

can we add some java docs here?

bleskes · 2016-02-01T17:53:26Z

LGTM . Thanks @ywelsch - Left some minor comments, no need for another cycle.

…cal routing table Closes elastic#16274 Closes elastic#12573 Closes elastic#12574

Prevent TransportReplicationAction to route request based on stale local routing table

…cal routing table Closes elastic#16274 Closes elastic#12573 Closes elastic#12574

…cal routing table (#19296) When relocating a primary shard, there is a cluster state update at the end of relocation where the active primary is switched from the relocation source to the relocation target. If relocation source receives and processes this cluster state before the relocation target, there is a time span where relocation source believes active primary to be on relocation target and relocation target believes active primary to be on relocation source. This results in index/delete/flush requests being sent back and forth and can end in an OOM on both nodes. Backport of #16274 to 2.4.0.

ywelsch added review :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Jan 27, 2016

ywelsch assigned bleskes Jan 27, 2016

ywelsch mentioned this pull request Jan 27, 2016

Cluster state delay can cause endless index request loop #12573

Closed

bleskes reviewed Feb 1, 2016
View reviewed changes

ywelsch force-pushed the fix/endless-index-loop branch from 397b6dc to 96be074 Compare February 1, 2016 18:31

Prevent TransportReplicationAction to route request based on stale lo…

af1f637

…cal routing table Closes elastic#16274 Closes elastic#12573 Closes elastic#12574

ywelsch force-pushed the fix/endless-index-loop branch from 96be074 to af1f637 Compare February 2, 2016 12:57

ywelsch added the v5.0.0-alpha1 label Feb 2, 2016

ywelsch pushed a commit that referenced this pull request Feb 2, 2016

Merge pull request #16274 from ywelsch/fix/endless-index-loop

c5a6ddf

Prevent TransportReplicationAction to route request based on stale local routing table

ywelsch merged commit c5a6ddf into elastic:master Feb 2, 2016

clintongormley added >enhancement >bug and removed >enhancement labels Feb 2, 2016

ywelsch mentioned this pull request Mar 11, 2016

Port Primary Terms to master #17044

Merged

ywelsch mentioned this pull request Jun 30, 2016

Nested RemoteTransportExceptions flood the logs and fill the disk #19187

Closed

ywelsch mentioned this pull request Jul 7, 2016

Prevent TransportReplicationAction to route request based on stale local routing table #19296

Merged

ywelsch pushed a commit to ywelsch/elasticsearch that referenced this pull request Jul 7, 2016

Prevent TransportReplicationAction to route request based on stale lo…

7f14f4b

…cal routing table Closes elastic#16274 Closes elastic#12573 Closes elastic#12574

ywelsch mentioned this pull request Sep 22, 2016

Primary relocation handoff #15900

Merged

tupakulamanoj mentioned this pull request Feb 25, 2026

NullPointerExceptions when flushing an index #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent TransportReplicationAction to route request based on stale local routing table#16274

Prevent TransportReplicationAction to route request based on stale local routing table#16274
ywelsch merged 1 commit intoelastic:masterfrom
ywelsch:fix/endless-index-loop

ywelsch commented Jan 27, 2016

Uh oh!

ywelsch commented Jan 27, 2016

Uh oh!

bleskes Feb 1, 2016

Uh oh!

bleskes Feb 1, 2016

Uh oh!

ywelsch commented Feb 1, 2016

Uh oh!

bleskes Feb 1, 2016

Uh oh!

bleskes commented Feb 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ywelsch commented Jan 27, 2016

Uh oh!

ywelsch commented Jan 27, 2016

Uh oh!

bleskes Feb 1, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes Feb 1, 2016

Choose a reason for hiding this comment

Uh oh!

ywelsch commented Feb 1, 2016

Uh oh!

bleskes Feb 1, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes commented Feb 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants