Recursive Delete on BlobContainer#43281
Recursive Delete on BlobContainer#43281original-brownbear merged 8 commits intoelastic:masterfrom original-brownbear:recursive-delete-on-container
Conversation
|
Pinging @elastic/es-distributed |
ywelsch
left a comment
There was a problem hiding this comment.
I've left 2 comments, looking good o.w.
| } | ||
| toDelete.addAll(container.listBlobs().keySet()); | ||
| container.deleteBlobsIgnoringIfNotExists(toDelete); | ||
| assertChildren(path, Collections.emptyList()); |
There was a problem hiding this comment.
should also assert that the current path does not exist as child in the parent path?
There was a problem hiding this comment.
Right ... this was a little awkward with the repo root case I guess that's why I dropped it, but yea let me add that real quick :)
| }); | ||
| } | ||
|
|
||
| private void asyncDelete(ActionListener<Void> listener) throws IOException { |
There was a problem hiding this comment.
why recurse by level here and not list all sublevels at once?
There was a problem hiding this comment.
I'm assuming you're talking about flat listing all the blobs and then just deleting one by one like I did for GCS and S3?:
Mainly just to keep the changes size down a little for now as that didn't get us much here: The gain of being more effective when listing here is kinda small on Azure since we delete one by one (so the time that goes into listing requests is a lot less relevant overall than it would be if we deleted 100 or 1k blobs at a time). I can do a bit of a bigger refactoring though and move the parallelization logic into the blob store. Let me look into that :)
|
@ywelsch are you ok with optimizing the Azure implementation in a follow up? I can only move to flat listing and deleting by moving all the logic for the parallelization into The comment on the tests I addressed and made them properly check the deleted directory itself as well :) |
|
thanks @ywelsch |
* Follow up to #43281: * Optimizing the Azure directory delete operation: * Same as with GCS and S3 we can simply flat list a prefix and then delete as we iterate instead of listing the directories recursively. This should require fewer actual list RPC calls and the logic becomes simpler
This is a prerequisite of #42189: * Add directory delete method to blob container specific to each implementation: * Some notes on the implementations: * AWS + GCS: We can simply exploit the fact that both AWS and GCS return blobs lexicographically ordered which allows us to simply delete in the same order that we receive the blobs from the listing request. For AWS this simply required listing without the delimiter setting (so we get a deep listing) and for GCS the same behavior is achieved by not using the directory mode on the listing invocation. The nice thing about this is, that even for very large numbers of blobs the memory requirements are now capped nicely since we go page by page when deleting. * For Azure I extended the parallelization to the listing calls as well and made it work recursively. I verified that this works with thread count `1` since we only block once in the initial thread and then fan out to a "graph" of child listeners that never block. * HDFS and FS are trivial since we have directory delete methods available for them * Enhances third party tests to ensure the new functionality works (I manually ran them for all cloud providers)
Backport of elastic/elasticsearch#43281 This fixes the snapshot deletion on azure where files would not have been deleted.
Backport of elastic/elasticsearch#43281 This fixes the snapshot deletion on azure where files would not have been deleted.
This is a prerequisite of #42189:
1since we only block once in the initial thread and then fan out to a "graph" of child listeners that never block.