Only retain reasonable history for peer recoveries by DaveCTurner · Pull Request #45208 · elastic/elasticsearch

DaveCTurner · 2019-08-05T18:04:39Z

Today if a shard is not fully allocated we maintain a retention lease for a
lost peer for up to 12 hours, retaining all operations that occur in that time
period so that we can recover this replica using an operations-based recovery
if it returns. However it is not always reasonable to perform an
operations-based recovery on such a replica: if the replica is a very long way
behind the rest of the replication group then it can be much quicker to perform
a file-based recovery instead.

This commit introduces a notion of "reasonable" recoveries. If an
operations-based recovery would involve copying only a small number of
operations, but the index is large, then an operations-based recovery is
reasonable; on the other hand if there are many operations to copy across and
the index itself is relatively small then it makes more sense to perform a
file-based recovery. We measure the size of the index by computing its number
of documents (including deleted documents) in all segments belonging to the
current safe commit, and compare this to the number of operations a lease is
retaining below the local checkpoint of the safe commit. We consider an
operations-based recovery to be reasonable iff it would involve replaying at
most 10% of the documents in the index.

The mechanism for this feature is to expire peer-recovery retention leases
early if they are retaining so much history that an operations-based recovery
using that lease would be unreasonable.

Relates #41536

Today if a shard is not fully allocated we maintain a retention lease for a lost peer for up to 12 hours, retaining all operations that occur in that time period so that we can recover this replica using an operations-based recovery if it returns. However it is not always reasonable to perform an operations-based recovery on such a replica: if the replica is a very long way behind the rest of the replication group then it can be much quicker to perform a file-based recovery instead. This commit introduces a notion of "reasonable" recoveries. If an operations-based recovery would involve copying only a small number of operations, but the index is large, then an operations-based recovery is reasonable; on the other hand if there are many operations to copy across and the index itself is relatively small then it makes more sense to perform a file-based recovery. We measure the size of the index by computing its number of documents (including deleted documents) in all segments belonging to the current safe commit, and compare this to the number of operations a lease is retaining below the local checkpoint of the safe commit. We consider an operations-based recovery to be reasonable iff it would involve replaying at most 10% of the documents in the index. The mechanism for this feature is to expire peer-recovery retention leases early if they are retaining so much history that an operations-based recovery using that lease would be unreasonable. Relates elastic#41536

…elling all leases anyway

elasticmachine · 2019-08-05T18:04:42Z

Pinging @elastic/es-distributed

…coveries

dnhatn

Looks good. I left some comments.

dnhatn · 2019-08-05T20:38:38Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+     * Defaults to retaining history for up to 10% of the documents in the shard. This can only be changed in tests, since this setting is
+     * intentionally unregistered.
+     */
+    public static final Setting<Double> REASONABLE_OPERATIONS_BASED_RECOVERY_PROPORTION_SETTING


Should we put this setting in IndexSettings.java instead? Also, if this setting is dynamic, we should invalidate the cached value when it is updated.

The Property.Dynamic property was leftover from an earlier iteration, removed in f806892, thanks. Yes you're right if we were consuming updates then we should also be invalidating the cache.

I'm not sure where we should declare this setting. It's basically a constant (except for tests) and this is where I'd naturally declare a constant.

Moved in to IndexSettings in c220530 since that's about the most sensible place for it in the light of that change.

dnhatn · 2019-08-05T20:41:51Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+            public synchronized long getAsLong() {
+                try (Engine.IndexCommitRef safeCommitRef = getEngine().acquireSafeIndexCommit()) {
+                    final IndexCommit safeCommit = safeCommitRef.getIndexCommit();
+                    final long generation = safeCommit.getGeneration();


Perhaps use SegmentInfos#getId instead of the generation of the commit as the cache key. We can have the same generation for different index commits if we perform a file-based recovery.

Yes, 763a944.

dnhatn · 2019-08-05T20:47:39Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+
+            @Override
+            public synchronized long getAsLong() {
+                try (Engine.IndexCommitRef safeCommitRef = getEngine().acquireSafeIndexCommit()) {


Releasing a commit is quite expensive. Maybe expose a new method that returns the safe commit without acquiring. I think it's fine to use the last commit (we already exposed it) instead of the safe commit here.

Please ignore this as it is not correct. We won't revisit the deletion policy on releasing a commit unless the safe commit advances.

I too am a little uneasy with the costs of this. With this implementation each call to getAsLong() is synchronized and obtains a new commit ref, which might occasionally be expensive to release. It only does this on replication groups that aren't fully-assigned, of course. An alternative would be to recalculate this while advancing the safe commit, then this could just become a volatile read. I couldn't determine a neat way to hook this calculation in the right place. Any ideas?

I tried moving the calculation to CombinedDeletionPolicy#updateRetentionPolicy in c220530. I think it makes sense there. WDYT?

dnhatn · 2019-08-05T20:53:48Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+
+                    final long localCheckpoint = Long.parseLong(safeCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY));
+
+                    final long totalDocs = StreamSupport.stream(


Maybe use Lucene.readSegmentInfos(safeCommit).totalMaxDoc()?

Ah thanks, 9cd3c80.

henningandersen · 2019-08-06T10:09:10Z

server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

+        final long localCheckpointOfSafeCommit = Long.parseLong(safeCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY));
+        softDeletesPolicy.setLocalCheckpointOfSafeCommit(localCheckpointOfSafeCommit);
+
+        final long docCountOfSafeCommit = getDocCountOfSafeCommit();


Drive by comment: I think this will do IO while holding the monitor. This would then prevent concurrent acquireIndexCommit and releaseIndexCommit calls. A quick scrape of usages did not find any where this would hurt, but I would anyway suggest to do the calculation of minimumReasonableRetainedSeqNo outside the monitor instead.

Yes, it will. We're already in a method that throws IOException but you're right, it doesn't look like it really does do much IO as it currently stands. Moved outside the mutex in 5860c41.

…coveries

DaveCTurner · 2019-08-06T10:22:31Z

@elasticmachine please test this

ywelsch

I've left mainly comments on naming and code structure, looking good o.w.

server/src/main/java/org/elasticsearch/index/IndexSettings.java

ywelsch · 2019-08-06T11:41:16Z

server/src/main/java/org/elasticsearch/index/IndexSettings.java

+     * intentionally unregistered.
+     */
+    public static final Setting<Double> REASONABLE_OPERATIONS_BASED_RECOVERY_PROPORTION_SETTING
+        = Setting.doubleSetting("index.recovery.reasonable_operations_based_recovery_proportion", 0.1, 0.0, Setting.Property.IndexScope);


I wonder if this should be called index.recovery.full_recovery_threshold or index.recovery.file_based_threshold.
It's essentially a threshold beyond which full file-based recoveries are preferred.
Same goes for the variable names throughout the rest of the PR

Ok, changed to index.recovery.file_based_threshold in 13a4c38.

server/src/main/java/org/elasticsearch/index/engine/ReadOnlyEngine.java

ywelsch · 2019-08-06T12:00:23Z

server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

+        final IndexCommit safeCommit = updateCommitsAndRetentionPolicy(commits);
+        final long localCheckpointOfSafeCommit = Long.parseLong(safeCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY));
+        final long docCountOfSafeCommit = getDocCountOfSafeCommit();
+        minimumReasonableRetainedSeqNo = localCheckpointOfSafeCommit + 1


what was the reason to compute this value here and not in ReplicationTracker? This class could e.g. return a pre-computed safeCommitInfo that contains local checkpoint + number of docs.
The actual computation could then be done in replication tracker, where I think that logic fits more naturally (and avoids the slightly odd minimumReasonableRetainedSeqNo naming that we have here).

It didn't really seem worth the extra plumbing to move this single expression, particularly since CombinedDeletionPolicy also has some bearing on history retention. It didn't seem any more natural to have it in ReplicationTracker. But I've done this in 13a4c38.

DaveCTurner · 2019-08-06T14:50:04Z

@elasticmachine please run elasticsearch-ci/packaging-sample

dnhatn · 2019-08-06T16:21:51Z

server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

    private volatile IndexCommit safeCommit; // the most recent safe commit point - its max_seqno at most the persisted global checkpoint.
    private volatile IndexCommit lastCommit; // the most recent commit point
+    private volatile SafeCommitInfo safeCommitInfo = SafeCommitInfo.EMPTY;
+    private final Object onCommitMutex = new Object();


Do we really need this separate mutex?

I am not sure, because it depends on whether onCommit is called concurrently or not. Is it? If it is then we could end up with stale data in safeCommitInfo.

Before this change, onCommit was a synchronized method. Is it okay if we leave it as is?

Henning asked to avoid running the IO needed to compute the safeCommitInfo under this mutex here: #45208 (comment).

Ah, ok. Although It's not an issue, I think we should avoid having two lock orderings:
this -> onCommitMutex in onInit and onCommitMutex -> this in onCommit. How about implementing onCommit in two steps using a single mutex.

final IndexCommit safeCommit; synchronized (this) { // update the policy safeCommit = this.safeCommit; } final SafeCommitInfo safeCommitInfo = .. synchronized (this) { if (safeCommit == this.safeCommit) { this.safeCommitInfo = safeCommitInfo; } }

dnhatn

LGTM. I left a comment around the mutex in the deletion policy for discussion.

dnhatn · 2019-08-07T01:28:05Z

server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

    private volatile IndexCommit safeCommit; // the most recent safe commit point - its max_seqno at most the persisted global checkpoint.
    private volatile IndexCommit lastCommit; // the most recent commit point
+    private volatile SafeCommitInfo safeCommitInfo = SafeCommitInfo.EMPTY;
+    private final Object onCommitMutex = new Object();


Ah, ok. Although It's not an issue, I think we should avoid having two lock orderings:
this -> onCommitMutex in onInit and onCommitMutex -> this in onCommit. How about implementing onCommit in two steps using a single mutex.

final IndexCommit safeCommit; synchronized (this) { // update the policy safeCommit = this.safeCommit; } final SafeCommitInfo safeCommitInfo = .. synchronized (this) { if (safeCommit == this.safeCommit) { this.safeCommitInfo = safeCommitInfo; } }

DaveCTurner · 2019-08-07T06:15:53Z

Ah yes you're right I missed that onInit is also synchronized. Thanks for the suggestion, I've implemented that.

DaveCTurner · 2019-08-07T07:06:22Z

@elasticmachine please run elasticsearch-ci/default-distro

ywelsch

Left one more comment, o.w. looking good

server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

…coveries

henningandersen

Left a couple of comments, looking good otherwise.

server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

…coveries

henningandersen

LGTM.

DaveCTurner · 2019-08-08T15:43:47Z

@elasticmachine please run elasticsearch-ci/1

original-brownbear · 2019-08-08T20:42:44Z

Merging + back-porting this for @DaveCTurner as discussed with him.

Today if a shard is not fully allocated we maintain a retention lease for a lost peer for up to 12 hours, retaining all operations that occur in that time period so that we can recover this replica using an operations-based recovery if it returns. However it is not always reasonable to perform an operations-based recovery on such a replica: if the replica is a very long way behind the rest of the replication group then it can be much quicker to perform a file-based recovery instead. This commit introduces a notion of "reasonable" recoveries. If an operations-based recovery would involve copying only a small number of operations, but the index is large, then an operations-based recovery is reasonable; on the other hand if there are many operations to copy across and the index itself is relatively small then it makes more sense to perform a file-based recovery. We measure the size of the index by computing its number of documents (including deleted documents) in all segments belonging to the current safe commit, and compare this to the number of operations a lease is retaining below the local checkpoint of the safe commit. We consider an operations-based recovery to be reasonable iff it would involve replaying at most 10% of the documents in the index. The mechanism for this feature is to expire peer-recovery retention leases early if they are retaining so much history that an operations-based recovery using that lease would be unreasonable. Relates #41536

DaveCTurner added 2 commits August 5, 2019 18:10

No need to calculate min reasonable seqno on green shards, we're canc…

afe7202

…elling all leases anyway

DaveCTurner added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.4.0 labels Aug 5, 2019

DaveCTurner requested review from dnhatn and ywelsch August 5, 2019 18:04

DaveCTurner added 3 commits August 5, 2019 19:09

Merge branch 'master' into 2019-08-05-avoid-unreasonable-ops-based-re…

01be053

…coveries

Shhh

d9a15c1

Clarify comment

d7f1a56

dnhatn reviewed Aug 5, 2019

View reviewed changes

DaveCTurner added 4 commits August 6, 2019 07:47

Setting is not dynamic

f806892

Use SegmentInfos#getId instead of IndexCommit#getGeneration

763a944

Use totalMaxDoc

9cd3c80

Move calculation to CombinedDeletionPolicy

c220530

DaveCTurner requested a review from dnhatn August 6, 2019 08:09

henningandersen reviewed Aug 6, 2019

View reviewed changes

DaveCTurner added 2 commits August 6, 2019 11:19

No need to read the segments under the mutex

5860c41

Merge branch 'master' into 2019-08-05-avoid-unreasonable-ops-based-re…

77459bc

…coveries

DaveCTurner requested a review from henningandersen August 6, 2019 10:21

ywelsch reviewed Aug 6, 2019

View reviewed changes

Move calculation back to ReplicationTracker

13a4c38

DaveCTurner requested a review from ywelsch August 6, 2019 14:08

IndexCommit is mocked, must mock its doc count

10d00c6

DaveCTurner mentioned this pull request Aug 6, 2019

Retain history for peer recovery using leases #41536

Closed

10 tasks

dnhatn reviewed Aug 6, 2019

View reviewed changes

DaveCTurner requested a review from dnhatn August 6, 2019 17:07

dnhatn approved these changes Aug 7, 2019

View reviewed changes

DaveCTurner added 2 commits August 7, 2019 07:10

Fix lock ordering

b628328

Harmony

2889e5d

ywelsch reviewed Aug 7, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java Show resolved Hide resolved

Merge branch 'master' into 2019-08-05-avoid-unreasonable-ops-based-re…

8f57e91

…coveries

DaveCTurner changed the title ~~Only retain reasonable history for peer recoveries~~ Only retain reasonable history for peer recoveries Aug 7, 2019

henningandersen reviewed Aug 7, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java Outdated Show resolved Hide resolved

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java Show resolved Hide resolved

DaveCTurner added 8 commits August 8, 2019 09:05

Add comment & TODO about nested docs

580d82b

Merge branch 'master' into 2019-08-05-avoid-unreasonable-ops-based-re…

8b23f77

…coveries

Merge branch 'master' into 2019-08-05-avoid-unreasonable-ops-based-re…

68c0428

…coveries

Assume no concurrency in onCommit()

3c58d5b

Also set safeCommitInfo in onInit()

e02814c

Just call onCommit() again

a0fbcf5

Inline

fbca5c6

Lack of concurrency confirmed, fixed comment

4d03f68

DaveCTurner requested a review from henningandersen August 8, 2019 14:54

henningandersen approved these changes Aug 8, 2019

View reviewed changes

original-brownbear merged commit fd4acb3 into elastic:master Aug 8, 2019

original-brownbear mentioned this pull request Aug 8, 2019

Only retain reasonable history for peer recoveries (#45208) #45355

Merged

DaveCTurner deleted the 2019-08-05-avoid-unreasonable-ops-based-recoveries branch August 21, 2019 08:23

original-brownbear mentioned this pull request Sep 18, 2019

Fix testHistoryRetention #46799

Merged

mfussenegger mentioned this pull request Mar 26, 2020

ES Backports crate/crate#9796

Closed

37 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021


		final long localCheckpoint = Long.parseLong(safeCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY));

		final long totalDocs = StreamSupport.stream(

Conversation

DaveCTurner commented Aug 5, 2019

Uh oh!

elasticmachine commented Aug 5, 2019

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Aug 6, 2019

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Aug 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Aug 7, 2019

Uh oh!

DaveCTurner commented Aug 7, 2019

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

henningandersen left a comment