Fix race condition in SnapshotBasedIndexRecoveryIT#79404
Fix race condition in SnapshotBasedIndexRecoveryIT#79404fcofdez merged 8 commits intoelastic:masterfrom
Conversation
If we don't cancel the re-location of the index to the same target node, it is possible that the recovery is retried, causing a race condition.
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
|
||
| targetMockTransportService.clearAllRules(); | ||
| channelRef.get().sendResponse(new IOException("unable to clean files")); | ||
| assertAcked( |
There was a problem hiding this comment.
I've been trying to find a way to ensure that the RecoveryTarget reference is released and therefore the snapshot file download permit is released but since it happens asynchronously I couldn't find a reliable way to be sure that the permit has been released. Maybe we should add a Thread.sleep here? 🤔
There was a problem hiding this comment.
Could we set index.allocation.max_retries: 1 rather than adding this filter? That way we can be sure that it's the failure that releases the permits and not the fact that the allocation filter causes allocation to be cancelled.
In terms of waiting for the permits to be released, maybe add a package-private method that exposes the RecoverySettings on the PeerRecoveryTargetService and then after updating the allocation filter you can assertBusy that all the permits can be acquired.
…index-recovery-it
|
@DaveCTurner would you mind taking a look into this when you have the chance? thanks! |
|
@elasticmachine update branch |
|
@elasticmachine run elasticsearch-ci/part-2 |
DaveCTurner
left a comment
There was a problem hiding this comment.
LGTM (with a couple of nits)
| Releasable snapshotDownloadPermit = peerRecoveryTargetService.tryAcquireSnapshotDownloadPermits(); | ||
| assertThat(snapshotDownloadPermit, is(notNullValue())); | ||
| snapshotDownloadPermit.close(); |
There was a problem hiding this comment.
Slight preference for using a try-with-resources here.
| .put(IndexService.GLOBAL_CHECKPOINT_SYNC_INTERVAL_SETTING.getKey(), "1s") | ||
| .put("index.routing.allocation.require._name", dataNodes.get(0)) | ||
| .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 0) | ||
| .put("index.allocation.max_retries", 0) |
There was a problem hiding this comment.
Slight preference for using the setting directly rather than its literal name:
| .put("index.allocation.max_retries", 0) | |
| .put(SETTING_ALLOCATION_MAX_RETRY.getKey(), 0) |
…index-recovery-it
…index-recovery-it
If we don't cancel the re-location of the index to the same target
node, it is possible that the recovery is retried, meaning that it's
possible that the available permit is granted to indexRecoveredFromSnapshot1
instead of to indexRecoveredFromSnapshot2.
Relates #79316
Closes #79420