Close translog view after primary-replica resync by ywelsch · Pull Request #25862 · elastic/elasticsearch

ywelsch · 2017-07-24T13:13:48Z

The translog view was being closed too early, possibly causing a failed resync. Note: The bug only affects unreleased code.

Relates to #24841.

The translog view was being closed too early, possibly causing a failed resync

…-view-after-resync

bleskes

Thanks @ywelsch

bleskes · 2017-07-27T12:03:02Z

core/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java

+                    if (state == IndexShardState.CLOSED) {
+                        throw new IndexShardClosedException(shardId);
+                    } else {
+                        assert state == IndexShardState.STARTED : "resync should only happen on a started shard";


++ . nit: add the state to the message please.

bleskes · 2017-07-27T12:08:35Z

core/src/test/java/org/elasticsearch/index/shard/PrimaryReplicaSyncerTests.java

+            }
+        });
+        if (randomBoolean()) {
+            assertBusy(() -> assertTrue("Sync action was not called", syncActionCalled.get()));


nit: this pains me :) can we use a latch?

ywelsch · 2017-07-27T12:37:01Z

Thanks @bleskes

The translog view was being closed too early, possibly causing a failed resync. Note: The bug only affects unreleased code. Relates to #24841

During peer recoveries, we need to copy over lucene files and replay the operations they miss from the source translog. Guaranteeing that translog files are not cleaned up has seen many iterations overtime. Back in the old 1.0 days, recoveries went through the Engine and actively prevented both translog cleaning and lucene commits. We then moved to a notion called Translog Views, which allowed the recovery code to "acquire" a view into the translog which is then guaranteed to be kept around until the view is closed. The Engine code was free to commit lucene and do what it ever it wanted without coordinating with recoveries. Translog file deletion logic was based on reference counting on the file level. Those counters were incremented when a view was acquired but also when the view was used to create a `Snapshot` that allowed you to read operations from the files. At some point we removed the file based counting complexity in favor of constructs on the Translog level that just keep track of "open" views and the minimum translog generation they refer to. To do so, Views had to be kept around until the last snapshot that was made from them was consumed. This was fine in recovery code but lead to [a subtle bug](#25862) in the [Primary Replica Resyncer](#25862). Concurrently, we have developed the notion of a `TranslogDeletionPolicy` which is responsible for the liveness aspect of translog files. This class makes it very simple to take translog Snapshot into account for keep translog files around, allowing people that just need a snapshot to just take a snapshot and not worry about views and such. Recovery code which actually does need a view can now prevent trimming by acquiring a simple retention lock (a `Closable`). This removes the need for the notion of a View.

Close translog view after primary-replica resync

1bf1cf7

The translog view was being closed too early, possibly causing a failed resync

ywelsch added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >bug v6.0.0 labels Jul 24, 2017

ywelsch requested a review from bleskes July 24, 2017 13:13

ywelsch added 2 commits July 27, 2017 11:51

Merge remote-tracking branch 'elastic/master' into fix/close-translog…

92913c8

…-view-after-resync

test and exception handling

5461a0b

bleskes approved these changes Jul 27, 2017

View reviewed changes

assertBusy -> latch

0585c56

ywelsch added v6.1.0 v7.0.0 labels Jul 27, 2017

ywelsch merged commit 020ba41 into elastic:master Jul 27, 2017

ywelsch added a commit that referenced this pull request Jul 27, 2017

Close translog view after primary-replica resync (#25862)

d6f5337

The translog view was being closed too early, possibly causing a failed resync. Note: The bug only affects unreleased code. Relates to #24841

ywelsch added a commit that referenced this pull request Jul 27, 2017

Close translog view after primary-replica resync (#25862)

8b3cf33

The translog view was being closed too early, possibly causing a failed resync. Note: The bug only affects unreleased code. Relates to #24841

bleskes mentioned this pull request Jul 30, 2017

Goodbye, Translog Views #25962

Merged

colings86 added v6.0.0-beta1 and removed v6.0.0 labels Jul 31, 2017

lcawl removed the v6.1.0 label Dec 12, 2017

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close translog view after primary-replica resync#25862

Close translog view after primary-replica resync#25862
ywelsch merged 4 commits intoelastic:masterfrom
ywelsch:fix/close-translog-view-after-resync

ywelsch commented Jul 24, 2017

Uh oh!

bleskes left a comment

Uh oh!

bleskes Jul 27, 2017

Uh oh!

bleskes Jul 27, 2017

Uh oh!

ywelsch commented Jul 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ywelsch commented Jul 24, 2017

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

bleskes Jul 27, 2017

Choose a reason for hiding this comment

Uh oh!

bleskes Jul 27, 2017

Choose a reason for hiding this comment

Uh oh!

ywelsch commented Jul 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants