Skip to content

Only retain reasonable history for peer recoveries#45208

Merged
original-brownbear merged 24 commits intoelastic:masterfrom
DaveCTurner:2019-08-05-avoid-unreasonable-ops-based-recoveries
Aug 8, 2019
Merged

Only retain reasonable history for peer recoveries#45208
original-brownbear merged 24 commits intoelastic:masterfrom
DaveCTurner:2019-08-05-avoid-unreasonable-ops-based-recoveries

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

Today if a shard is not fully allocated we maintain a retention lease for a
lost peer for up to 12 hours, retaining all operations that occur in that time
period so that we can recover this replica using an operations-based recovery
if it returns. However it is not always reasonable to perform an
operations-based recovery on such a replica: if the replica is a very long way
behind the rest of the replication group then it can be much quicker to perform
a file-based recovery instead.

This commit introduces a notion of "reasonable" recoveries. If an
operations-based recovery would involve copying only a small number of
operations, but the index is large, then an operations-based recovery is
reasonable; on the other hand if there are many operations to copy across and
the index itself is relatively small then it makes more sense to perform a
file-based recovery. We measure the size of the index by computing its number
of documents (including deleted documents) in all segments belonging to the
current safe commit, and compare this to the number of operations a lease is
retaining below the local checkpoint of the safe commit. We consider an
operations-based recovery to be reasonable iff it would involve replaying at
most 10% of the documents in the index.

The mechanism for this feature is to expire peer-recovery retention leases
early if they are retaining so much history that an operations-based recovery
using that lease would be unreasonable.

Relates #41536

Today if a shard is not fully allocated we maintain a retention lease for a
lost peer for up to 12 hours, retaining all operations that occur in that time
period so that we can recover this replica using an operations-based recovery
if it returns. However it is not always reasonable to perform an
operations-based recovery on such a replica: if the replica is a very long way
behind the rest of the replication group then it can be much quicker to perform
a file-based recovery instead.

This commit introduces a notion of "reasonable" recoveries. If an
operations-based recovery would involve copying only a small number of
operations, but the index is large, then an operations-based recovery is
reasonable; on the other hand if there are many operations to copy across and
the index itself is relatively small then it makes more sense to perform a
file-based recovery. We measure the size of the index by computing its number
of documents (including deleted documents) in all segments belonging to the
current safe commit, and compare this to the number of operations a lease is
retaining below the local checkpoint of the safe commit. We consider an
operations-based recovery to be reasonable iff it would involve replaying at
most 10% of the documents in the index.

The mechanism for this feature is to expire peer-recovery retention leases
early if they are retaining so much history that an operations-based recovery
using that lease would be unreasonable.

Relates elastic#41536
@DaveCTurner DaveCTurner added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.4.0 labels Aug 5, 2019
@DaveCTurner DaveCTurner requested review from dnhatn and ywelsch August 5, 2019 18:04
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed

Copy link
Copy Markdown
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I left some comments.

* Defaults to retaining history for up to 10% of the documents in the shard. This can only be changed in tests, since this setting is
* intentionally unregistered.
*/
public static final Setting<Double> REASONABLE_OPERATIONS_BASED_RECOVERY_PROPORTION_SETTING
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this setting in IndexSettings.java instead? Also, if this setting is dynamic, we should invalidate the cached value when it is updated.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Property.Dynamic property was leftover from an earlier iteration, removed in f806892, thanks. Yes you're right if we were consuming updates then we should also be invalidating the cache.

I'm not sure where we should declare this setting. It's basically a constant (except for tests) and this is where I'd naturally declare a constant.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved in to IndexSettings in c220530 since that's about the most sensible place for it in the light of that change.

public synchronized long getAsLong() {
try (Engine.IndexCommitRef safeCommitRef = getEngine().acquireSafeIndexCommit()) {
final IndexCommit safeCommit = safeCommitRef.getIndexCommit();
final long generation = safeCommit.getGeneration();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use SegmentInfos#getId instead of the generation of the commit as the cache key. We can have the same generation for different index commits if we perform a file-based recovery.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 763a944.


@Override
public synchronized long getAsLong() {
try (Engine.IndexCommitRef safeCommitRef = getEngine().acquireSafeIndexCommit()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Releasing a commit is quite expensive. Maybe expose a new method that returns the safe commit without acquiring. I think it's fine to use the last commit (we already exposed it) instead of the safe commit here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore this as it is not correct. We won't revisit the deletion policy on releasing a commit unless the safe commit advances.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too am a little uneasy with the costs of this. With this implementation each call to getAsLong() is synchronized and obtains a new commit ref, which might occasionally be expensive to release. It only does this on replication groups that aren't fully-assigned, of course. An alternative would be to recalculate this while advancing the safe commit, then this could just become a volatile read. I couldn't determine a neat way to hook this calculation in the right place. Any ideas?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried moving the calculation to CombinedDeletionPolicy#updateRetentionPolicy in c220530. I think it makes sense there. WDYT?


final long localCheckpoint = Long.parseLong(safeCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY));

final long totalDocs = StreamSupport.stream(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use Lucene.readSegmentInfos(safeCommit).totalMaxDoc()?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, 9cd3c80.

@DaveCTurner DaveCTurner requested a review from dnhatn August 6, 2019 08:09
final long localCheckpointOfSafeCommit = Long.parseLong(safeCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY));
softDeletesPolicy.setLocalCheckpointOfSafeCommit(localCheckpointOfSafeCommit);

final long docCountOfSafeCommit = getDocCountOfSafeCommit();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by comment: I think this will do IO while holding the monitor. This would then prevent concurrent acquireIndexCommit and releaseIndexCommit calls. A quick scrape of usages did not find any where this would hurt, but I would anyway suggest to do the calculation of minimumReasonableRetainedSeqNo outside the monitor instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will. We're already in a method that throws IOException but you're right, it doesn't look like it really does do much IO as it currently stands. Moved outside the mutex in 5860c41.

@DaveCTurner
Copy link
Copy Markdown
Member Author

@elasticmachine please test this

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left mainly comments on naming and code structure, looking good o.w.

* intentionally unregistered.
*/
public static final Setting<Double> REASONABLE_OPERATIONS_BASED_RECOVERY_PROPORTION_SETTING
= Setting.doubleSetting("index.recovery.reasonable_operations_based_recovery_proportion", 0.1, 0.0, Setting.Property.IndexScope);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be called index.recovery.full_recovery_threshold or index.recovery.file_based_threshold.
It's essentially a threshold beyond which full file-based recoveries are preferred.
Same goes for the variable names throughout the rest of the PR

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed to index.recovery.file_based_threshold in 13a4c38.

final IndexCommit safeCommit = updateCommitsAndRetentionPolicy(commits);
final long localCheckpointOfSafeCommit = Long.parseLong(safeCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY));
final long docCountOfSafeCommit = getDocCountOfSafeCommit();
minimumReasonableRetainedSeqNo = localCheckpointOfSafeCommit + 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the reason to compute this value here and not in ReplicationTracker? This class could e.g. return a pre-computed safeCommitInfo that contains local checkpoint + number of docs.
The actual computation could then be done in replication tracker, where I think that logic fits more naturally (and avoids the slightly odd minimumReasonableRetainedSeqNo naming that we have here).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't really seem worth the extra plumbing to move this single expression, particularly since CombinedDeletionPolicy also has some bearing on history retention. It didn't seem any more natural to have it in ReplicationTracker. But I've done this in 13a4c38.

@DaveCTurner DaveCTurner requested a review from ywelsch August 6, 2019 14:08
@DaveCTurner
Copy link
Copy Markdown
Member Author

@elasticmachine please run elasticsearch-ci/packaging-sample

private volatile IndexCommit safeCommit; // the most recent safe commit point - its max_seqno at most the persisted global checkpoint.
private volatile IndexCommit lastCommit; // the most recent commit point
private volatile SafeCommitInfo safeCommitInfo = SafeCommitInfo.EMPTY;
private final Object onCommitMutex = new Object();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this separate mutex?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, because it depends on whether onCommit is called concurrently or not. Is it? If it is then we could end up with stale data in safeCommitInfo.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change, onCommit was a synchronized method. Is it okay if we leave it as is?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Henning asked to avoid running the IO needed to compute the safeCommitInfo under this mutex here: #45208 (comment).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Although It's not an issue, I think we should avoid having two lock orderings:
this -> onCommitMutex in onInit and onCommitMutex -> this in onCommit. How about implementing onCommit in two steps using a single mutex.

final IndexCommit safeCommit;
synchronized (this) {
	// update the policy
        safeCommit = this.safeCommit;
}
final SafeCommitInfo safeCommitInfo = ..
synchronized (this) {
	if (safeCommit == this.safeCommit) {
		this.safeCommitInfo = safeCommitInfo;
	}
}

@DaveCTurner DaveCTurner requested a review from dnhatn August 6, 2019 17:07
Copy link
Copy Markdown
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left a comment around the mutex in the deletion policy for discussion.

private volatile IndexCommit safeCommit; // the most recent safe commit point - its max_seqno at most the persisted global checkpoint.
private volatile IndexCommit lastCommit; // the most recent commit point
private volatile SafeCommitInfo safeCommitInfo = SafeCommitInfo.EMPTY;
private final Object onCommitMutex = new Object();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Although It's not an issue, I think we should avoid having two lock orderings:
this -> onCommitMutex in onInit and onCommitMutex -> this in onCommit. How about implementing onCommit in two steps using a single mutex.

final IndexCommit safeCommit;
synchronized (this) {
	// update the policy
        safeCommit = this.safeCommit;
}
final SafeCommitInfo safeCommitInfo = ..
synchronized (this) {
	if (safeCommit == this.safeCommit) {
		this.safeCommitInfo = safeCommitInfo;
	}
}

@DaveCTurner
Copy link
Copy Markdown
Member Author

Ah yes you're right I missed that onInit is also synchronized. Thanks for the suggestion, I've implemented that.

@DaveCTurner
Copy link
Copy Markdown
Member Author

@elasticmachine please run elasticsearch-ci/default-distro

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one more comment, o.w. looking good

@DaveCTurner DaveCTurner changed the title Only retain reasonable history for peer recoveries Only retain reasonable history for peer recoveries Aug 7, 2019
Copy link
Copy Markdown
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments, looking good otherwise.

Copy link
Copy Markdown
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@DaveCTurner
Copy link
Copy Markdown
Member Author

@elasticmachine please run elasticsearch-ci/1

@original-brownbear
Copy link
Copy Markdown
Contributor

Merging + back-porting this for @DaveCTurner as discussed with him.

@original-brownbear original-brownbear merged commit fd4acb3 into elastic:master Aug 8, 2019
original-brownbear added a commit that referenced this pull request Aug 8, 2019
Today if a shard is not fully allocated we maintain a retention lease for a
lost peer for up to 12 hours, retaining all operations that occur in that time
period so that we can recover this replica using an operations-based recovery
if it returns. However it is not always reasonable to perform an
operations-based recovery on such a replica: if the replica is a very long way
behind the rest of the replication group then it can be much quicker to perform
a file-based recovery instead.

This commit introduces a notion of "reasonable" recoveries. If an
operations-based recovery would involve copying only a small number of
operations, but the index is large, then an operations-based recovery is
reasonable; on the other hand if there are many operations to copy across and
the index itself is relatively small then it makes more sense to perform a
file-based recovery. We measure the size of the index by computing its number
of documents (including deleted documents) in all segments belonging to the
current safe commit, and compare this to the number of operations a lease is
retaining below the local checkpoint of the safe commit. We consider an
operations-based recovery to be reasonable iff it would involve replaying at
most 10% of the documents in the index.

The mechanism for this feature is to expire peer-recovery retention leases
early if they are retaining so much history that an operations-based recovery
using that lease would be unreasonable.

Relates #41536
@DaveCTurner DaveCTurner deleted the 2019-08-05-avoid-unreasonable-ops-based-recoveries branch August 21, 2019 08:23
@mfussenegger mfussenegger mentioned this pull request Mar 26, 2020
37 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v7.4.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants