Shorten time that snapshot finalization blocks repo#89572
Shorten time that snapshot finalization blocks repo#89572original-brownbear merged 1 commit intoelastic:mainfrom original-brownbear:faster-snapshot-finalization
Conversation
For large snapshot clusters the final cleanup step that removes previous shard metadata might have to delete tens of thousands of blobs. This can take many seconds, for which the repository is needlessly blocked from running subsequent finalizations or deletes. Similar to how the final cleanup step was made async and taken out of the finalization loop for deletes in #86514 this PR moves the final cleanup step for finalization async.
|
Pinging @elastic/es-distributed (Team:Distributed) |
fcofdez
left a comment
There was a problem hiding this comment.
Looks good to me, but I left a question, just to ensure that I understand the change correctly.
| } catch (Exception e) { | ||
| logger.warn("Failed to clean up old shard generation blobs", e); | ||
| if (toDelete.isEmpty() == false) { | ||
| threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> { |
There was a problem hiding this comment.
Since we compute the list of files to delete based on the old/new repository data, would it be possible for two delete operations to try and delete an intersection of files? I guess that will be fine, the only minor issue I can see is that we might end up having more queued tasks in the SNAPSHOT thread pool, but the risk should be minimal?
There was a problem hiding this comment.
would it be possible for two delete operations to try and delete an intersection of files?
No this should not happen since we compute the files to delete from the repo data diff. So if a file A goes out of scope when moving from generation 1 to 2, then it will never show up in generation 2 and thus won't be in the diff from 2 to 3.
I guess that will be fine, the only minor issue I can see is that we might end up having more queued tasks in the SNAPSHOT thread pool, but the risk should be minimal?
This is actually something that happens in the many-shards snapshot benchmark. Assuming you have back to back snapshot finalizations, it's really a function of how fast cluster state updates are vs how fast those deletes are. But it's not really an issue IMO, we're still only talking about O(10s) tasks here in the worst cases (GCP + 50k indices).
fcofdez
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the clarifications!
|
Thanks Francisco! |
| } | ||
|
|
||
| public void onDone(SnapshotInfo snapshotInfo) { | ||
| onDone.accept(snapshotInfo); |
There was a problem hiding this comment.
Should we assert that onResponse() has been called at this stage? I also wonder if renaming the method to something like afterCleanUp() would be clearer?
There was a problem hiding this comment.
Should we assert that onResponse() has been called at this stage?
I wanted to do that, but this isn't actually 100% true for the S3 waiting implementation it turns out ... which is kind of stupid, I'll open a discussion about removing it :)
| } catch (Exception e) { | ||
| logger.warn("Failed to clean up old shard generation blobs", e); | ||
| } finally { | ||
| finalizeSnapshotContext.onDone(snapshotInfo); |
There was a problem hiding this comment.
Do we really need to pass the SnapshotInfo instance back to where it is originated from? 🤔
| finalizeSnapshotContext.onDone(snapshotInfo); | |
| finalizeSnapshotContext.afterCleanUp(); |
There was a problem hiding this comment.
Hmm that is a good point, I just realized we don't need to pass it back here indeed ... we used to need to do that because we computed SnapshotInfo in the repository but that's not the case anymore. I'll follow up with a simplification PR, thanks for spotting this!
This fixes a minor bug introduced in #89572. We need to take the listeners for the snapshot out of the listeners map for the repo right away before trying to run the next repo operation like we did prior to that change. Otherwise we run the risk of a fatal exception failing all operations for the repo like a no-longer-master exception did in #89625 even though the snapshot completed successfully. closes #89625
For large snapshot clusters the final cleanup step that removes
previous shard metadata might have to delete tens of thousands of blobs.
This can take many seconds, for which the repository is needlessly blocked
from running subsequent finalizations or deletes.
Similar to how the final cleanup step was made async and taken out of the
finalization loop for deletes in #86514 this PR moves the final
cleanup step for finalization async.
This produces a measurable speedup in the many-shards snapshotting benchmark run of a couple minutes
saved end-to-end for creating 100 snapshots in a 25k indices cluster.
relates #77466