Skip to content

Retrying replication requests on replica doesn't call onRetry#21189

Merged
bleskes merged 3 commits intoelastic:masterfrom
bleskes:retry_on_replica
Oct 31, 2016
Merged

Retrying replication requests on replica doesn't call onRetry#21189
bleskes merged 3 commits intoelastic:masterfrom
bleskes:retry_on_replica

Conversation

@bleskes
Copy link
Copy Markdown
Contributor

@bleskes bleskes commented Oct 30, 2016

Replication request may arrive at a replica before the replica's node has processed a required mapping update. In these cases the TransportReplicationAction will retry the request once a new cluster state arrives. Sadly that retry logic failed to call ReplicationRequest#onRetry, causing duplicates in the append only use case.

This PR fixes this and also the test which missed the check. I also added an assertion which would have helped finding the source of the duplicates.

This was discovered by https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=opensuse/174/

The test also surfaces an issue with mapping updates on the master (they are potentially performed on a live index :( ) but this will be fixed in another PR.

Relates #20211

@s1monw
Copy link
Copy Markdown
Contributor

s1monw commented Oct 30, 2016

bummer that we missed that one... glad we have it fixed now.

throw new AssertionError("doc [" + index.type() + "][" + index.id() + "] exists in version map (version " + versionValue + ")");
}
try (final Searcher searcher = acquireSearcher("assert doc doesn't exist")) {
final long docsWithId = searcher.reader().totalTermFreq(index.uid());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tricky TTF will not respect deletes, I think you need to fetch the doc though. It's not much overhead compared to this.

(TransportReplicationAction.ConcreteShardRequest<Request>) capturedRequest.request;
assertThat(concreteShardRequest.getRequest(), equalTo(request));
assertThat(concreteShardRequest.getRequest().isRetrySet.get(), equalTo(true));
assertThat(concreteShardRequest.getTargetAllocationID(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@s1monw
Copy link
Copy Markdown
Contributor

s1monw commented Oct 30, 2016

LGTM except of the assert

throw new AssertionError("doc [" + index.type() + "][" + index.id() + "] exists in version map (version " + versionValue + ")");
}
try (final Searcher searcher = acquireSearcher("assert doc doesn't exist")) {
final long docsWithId = searcher.searcher().count(new TermQuery(index.uid()));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s1monw like this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this looks better but I think we can still have a doc in the index when we haven't refreshed AND the doc is deleted? I think we should not check the searcher if versionValue.delete() == true?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s1monw good one. I hardened the check

@s1monw
Copy link
Copy Markdown
Contributor

s1monw commented Oct 31, 2016

LGTM

@s1monw
Copy link
Copy Markdown
Contributor

s1monw commented Oct 31, 2016

test this please

@bleskes bleskes merged commit e7cfe10 into elastic:master Oct 31, 2016
@bleskes bleskes deleted the retry_on_replica branch October 31, 2016 12:43
@bleskes
Copy link
Copy Markdown
Contributor Author

bleskes commented Oct 31, 2016

thx @s1monw . I'll give it a few hours on CI before back porting.

@s1monw
Copy link
Copy Markdown
Contributor

s1monw commented Oct 31, 2016

++ thanks @bleskes

bleskes added a commit that referenced this pull request Nov 1, 2016
Replication request may arrive at a replica before the replica's node has processed a required mapping update. In these cases the TransportReplicationAction will retry the request once a new cluster state arrives. Sadly that retry logic failed to call `ReplicationRequest#onRetry`, causing duplicates in the append only use case.

This commit fixes this and also the test which missed the check. I also added an assertion which would have helped finding the source of the duplicates.

This was discovered by https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=opensuse/174/

Relates #20211
bleskes added a commit that referenced this pull request Nov 1, 2016
Replication request may arrive at a replica before the replica's node has processed a required mapping update. In these cases the TransportReplicationAction will retry the request once a new cluster state arrives. Sadly that retry logic failed to call `ReplicationRequest#onRetry`, causing duplicates in the append only use case.

This commit fixes this and also the test which missed the check. I also added an assertion which would have helped finding the source of the duplicates.

This was discovered by https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=opensuse/174/

Relates #20211
@bleskes
Copy link
Copy Markdown
Contributor Author

bleskes commented Nov 1, 2016

This is now pushed to 5.0.1 & 5.1.0 as well

bleskes added a commit that referenced this pull request Nov 15, 2016
When processing a mapping updates, the master current creates an `IndexService` and uses its mapper service to do the hard work. However, if the master is also a data node and it already has an instance of `IndexService`, we currently reuse the the `MapperService` of that instance. Sadly, since mapping updates are change the in memory objects, this means that a mapping change that can rejected later on during cluster state publishing will leave a side effect on the index in question, bypassing the cluster state safety mechanism.

This commit removes this optimization and replaces the `IndexService` creation with a direct creation of a `MapperService`. 

Also, this fixes an issue multiple from multiple shards for the same field caused unneeded cluster state publishing as the current code always created a new cluster state.

This were discovered while researching #21189
bleskes added a commit that referenced this pull request Nov 16, 2016
When processing a mapping updates, the master current creates an `IndexService` and uses its mapper service to do the hard work. However, if the master is also a data node and it already has an instance of `IndexService`, we currently reuse the the `MapperService` of that instance. Sadly, since mapping updates are change the in memory objects, this means that a mapping change that can rejected later on during cluster state publishing will leave a side effect on the index in question, bypassing the cluster state safety mechanism.

This commit removes this optimization and replaces the `IndexService` creation with a direct creation of a `MapperService`.

Also, this fixes an issue multiple from multiple shards for the same field caused unneeded cluster state publishing as the current code always created a new cluster state.

This were discovered while researching #21189
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Engine labels Feb 13, 2018
@clintongormley clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug critical :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v5.0.1 v5.1.1 v6.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants