Fail engine if hit document failure on replicas by dnhatn · Pull Request #43523 · elastic/elasticsearch

dnhatn · 2019-06-24T04:27:02Z

An indexing on a replica should never fail after it was successfully indexed on a primary. Hence, we should fail an engine if we hit any failure (document level or tragic failure) when processing an indexing on a replica.

Relates #43228
Closes #40435 (see #40435 (comment)).

We should not generate Noops for failed indexing operations on replicas or followers.

elasticmachine · 2019-06-24T04:27:04Z

Pinging @elastic/es-distributed

ywelsch

This is a tricky PR. We want to make sure we're not recording an operation as failed in the translog when we fail to add it to Lucene on a replica. Instead, we let the failure bubble up to the primary so that it can fail the replica. We could also consider this as a fatal failure, and directly fail the shard once indexing into Lucene fails.
The case we also need to consider is when we replay from the translog to Lucene on recovery from store. Should we then also fail the primary if we fail to replay the operation? This could mean that the primary is unrecoverable, e.g. because of some incompatibility introduced during an upgrade. If we're lenient there, however, it brings the risk of primary and replica going out of sync (if we let the replica locally recover up to global checkpoint). Perhaps we could allow a way for the shard to be recovered with a force command, which changes the history uuid. I think we need a more comprehensive plan here.

dnhatn · 2019-07-10T16:55:11Z

@ywelsch I've updated this PR to proceed with operations on replicas only. Can you please take a look? Thank you!

ywelsch · 2019-07-11T09:43:11Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

            return new IndexResult(plan.versionForIndexing, index.primaryTerm(), index.seqNo(), plan.currentNotFoundOrDeleted);
        } catch (Exception ex) {
-            if (indexWriter.getTragicException() == null) {
+            if ( treatDocumentFailureAsTragicError(index) == false && indexWriter.getTragicException() == null) {


should we treat AlreadyClosedException specially here as well (same as when we index deletion or noop tombstone).

We should not have a special treatment for AlreadyClosedException here. If the engine was failed and closed by other thread, it's perfectly fine to bubble up to the AlreadyClosedException. In fact, we should bubble up AlreadyClosedException so we can detect situations where the engine has a buggy state.

However, we probably should call maybeFailEngine instead of failEngine if the exception is AlreadyClosedEngine to avoid unnecessary warning log if the engine was failed already.

I think we should not try to wrap AlreadyClosedException into an IndexResult as we might possibly write it to the translog during closing.

++. Fixed in c632526.

ywelsch · 2019-07-11T09:48:36Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

            try {
-                maybeFailEngine("index", e);
+                if (treatDocumentFailureAsTragicError(index)) {
+                    failEngine("index", e);


can we add more info about document into the "reason" string?

I pushed 8725216

I meant some info about the document itself, i.e. the id of the document (This could help in figuring out why the given failure happened)

I added more info in c632526.

dnhatn · 2019-07-14T23:25:04Z

Thanks @ywelsch.

An indexing on a replica should never fail after it was successfully indexed on a primary. Hence, we should fail an engine if we hit any failure (document level or tragic failure) when processing an indexing on a replica. Relates #43228 Closes #40435

Fixed in #43523

Backport of elastic/elasticsearch#43523

Backport of elastic/elasticsearch#43523 (cherry picked from commit 9929cb2) # Conflicts: # blackbox/docs/appendices/release-notes/unreleased.rst # es/es-server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

Backport of elastic/elasticsearch#43523 (cherry picked from commit 9929cb2)

Backport of elastic/elasticsearch#43523

An indexing on a replica should never fail after it was successfully indexed on a primary. Hence, we should fail an engine if we hit any failure (document level or tragic failure) when processing an indexing on a replica. Relates #43228 Closes #40435

Only generate noop for failed indexing on primary

8f74d62

We should not generate Noops for failed indexing operations on replicas or followers.

dnhatn added >bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v8.0.0 v7.3.0 v7.2.1 labels Jun 24, 2019

dnhatn requested a review from ywelsch June 24, 2019 04:27

dnhatn added 2 commits June 24, 2019 02:33

Merge branch 'master' into noops

30473e0

fix assertion

60d6401

ywelsch suggested changes Jun 24, 2019

View reviewed changes

jpountz added v7.4.0 and removed v7.3.0 labels Jul 3, 2019

dnhatn mentioned this pull request Jul 9, 2019

[CI] SearchWithRandomExceptionsIT timeout #40435

Closed

dnhatn added 2 commits July 9, 2019 19:24

Merge branch 'master' into noops

3bcbb64

Fail engine if hit document failure on replicas

bf01884

dnhatn changed the title ~~Only generate noop for failed indexing on primary~~ Fail engine if hit document failure on non-primary indexing Jul 10, 2019

dnhatn changed the title ~~Fail engine if hit document failure on non-primary indexing~~ Fail engine if hit failure on non-primary indexing Jul 10, 2019

dnhatn removed the v7.2.1 label Jul 10, 2019

proceed with replica only

d055f46

dnhatn changed the title ~~Fail engine if hit failure on non-primary indexing~~ Fail engine if hit document failure on replicas Jul 10, 2019

dnhatn requested a review from ywelsch July 10, 2019 16:55

dnhatn added 3 commits July 10, 2019 12:58

fix the mess of renaming

109395d

rename again

3324421

Merge branch 'master' into noops

9da1359

ywelsch reviewed Jul 11, 2019

View reviewed changes

dnhatn added 2 commits July 11, 2019 23:30

Merge branch 'master' into noops

3d8a6d8

add more info

8725216

dnhatn requested a review from ywelsch July 12, 2019 03:49

dnhatn merged commit cb3e0cb into elastic:master Jul 14, 2019

dnhatn deleted the noops branch July 14, 2019 23:25

dnhatn added the backport pending label Jul 14, 2019

dnhatn removed the backport pending label Jul 15, 2019

dnhatn added a commit that referenced this pull request Jul 15, 2019

Unmute SearchWithRandomExceptionsIT

827383a

Fixed in #43523

kovrus added a commit to crate/crate that referenced this pull request Sep 3, 2019

Fail engine if hit document failure on replicas.

5a8a0f1

Backport of elastic/elasticsearch#43523

kovrus mentioned this pull request Sep 3, 2019

Fail engine if hit document failure on replicas. crate/crate#9098

Merged

5 tasks

kovrus added a commit to crate/crate that referenced this pull request Sep 3, 2019

Fail engine if hit document failure on replicas.

5e280b4

Backport of elastic/elasticsearch#43523

kovrus added a commit to crate/crate that referenced this pull request Sep 3, 2019

Fail engine if hit document failure on replicas.

2735050

Backport of elastic/elasticsearch#43523

kovrus added a commit to crate/crate that referenced this pull request Sep 3, 2019

Fail engine if hit document failure on replicas.

97cca9d

Backport of elastic/elasticsearch#43523

kovrus added a commit to crate/crate that referenced this pull request Sep 3, 2019

Fail engine if hit document failure on replicas.

09014a5

Backport of elastic/elasticsearch#43523

kovrus added a commit to crate/crate that referenced this pull request Sep 4, 2019

Fail engine if hit document failure on replicas.

89d03d0

Backport of elastic/elasticsearch#43523

kovrus added a commit to crate/crate that referenced this pull request Sep 4, 2019

Fail engine if hit document failure on replicas.

07ecc20

Backport of elastic/elasticsearch#43523

kovrus added a commit to crate/crate that referenced this pull request Sep 4, 2019

Fail engine if hit document failure on replicas.

304545b

Backport of elastic/elasticsearch#43523

kovrus added a commit to crate/crate that referenced this pull request Sep 4, 2019

Fail engine if hit document failure on replicas.

8e1e24e

Backport of elastic/elasticsearch#43523

mergify bot pushed a commit to crate/crate that referenced this pull request Sep 4, 2019

Fail engine if hit document failure on replicas.

9929cb2

Backport of elastic/elasticsearch#43523

kovrus added a commit to crate/crate that referenced this pull request Sep 24, 2019

Fail engine if hit document failure on replicas.

b34f07a

Backport of elastic/elasticsearch#43523 (cherry picked from commit 9929cb2)

kovrus added a commit to crate/crate that referenced this pull request Sep 25, 2019

Fail engine if hit document failure on replicas.

3962b41

Backport of elastic/elasticsearch#43523 (cherry picked from commit 9929cb2)

kovrus added a commit to crate/crate that referenced this pull request Sep 25, 2019

Fail engine if hit document failure on replicas.

7abbf5e

Backport of elastic/elasticsearch#43523

mergify bot pushed a commit to crate/crate that referenced this pull request Sep 25, 2019

Fail engine if hit document failure on replicas.

333f96c

Backport of elastic/elasticsearch#43523

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail engine if hit document failure on replicas#43523

Fail engine if hit document failure on replicas#43523
dnhatn merged 14 commits intoelastic:masterfrom
dnhatn:noops

dnhatn commented Jun 24, 2019 •

edited

Loading

Uh oh!

elasticmachine commented Jun 24, 2019

Uh oh!

ywelsch left a comment

Uh oh!

dnhatn commented Jul 10, 2019

Uh oh!

ywelsch Jul 11, 2019

Uh oh!

dnhatn Jul 12, 2019 •

edited

Loading

Uh oh!

ywelsch Jul 12, 2019

Uh oh!

dnhatn Jul 12, 2019

Uh oh!

ywelsch Jul 11, 2019

Uh oh!

dnhatn Jul 12, 2019

Uh oh!

ywelsch Jul 12, 2019

Uh oh!

dnhatn Jul 12, 2019

Uh oh!

dnhatn commented Jul 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

dnhatn commented Jun 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Jun 24, 2019

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Jul 10, 2019

Uh oh!

ywelsch Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

dnhatn Jul 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywelsch Jul 12, 2019

Choose a reason for hiding this comment

Uh oh!

dnhatn Jul 12, 2019

Choose a reason for hiding this comment

Uh oh!

ywelsch Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

dnhatn Jul 12, 2019

Choose a reason for hiding this comment

Uh oh!

ywelsch Jul 12, 2019

Choose a reason for hiding this comment

Uh oh!

dnhatn Jul 12, 2019

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Jul 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dnhatn commented Jun 24, 2019 •

edited

Loading

dnhatn Jul 12, 2019 •

edited

Loading