Skip to content

Recursive Delete on BlobContainer#43281

Merged
original-brownbear merged 8 commits intoelastic:masterfrom
original-brownbear:recursive-delete-on-container
Jun 18, 2019
Merged

Recursive Delete on BlobContainer#43281
original-brownbear merged 8 commits intoelastic:masterfrom
original-brownbear:recursive-delete-on-container

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear commented Jun 17, 2019

This is a prerequisite of #42189:

  • Add directory delete method to blob container specific to each implementation:
    • Some notes on the implementations:
      • AWS + GCS: We can simply exploit the fact that both AWS and GCS return blobs lexicographically ordered which allows us to simply delete in the same order that we receive the blobs from the listing request. For AWS this simply required listing without the delimiter setting (so we get a deep listing) and for GCS the same behavior is achieved by not using the directory mode on the listing invocation. The nice thing about this is, that even for very large numbers of blobs the memory requirements are now capped nicely since we go page by page when deleting.
      • For Azure I extended the parallelization to the listing calls as well and made it work recursively. I verified that this works with thread count 1 since we only block once in the initial thread and then fan out to a "graph" of child listeners that never block.
      • HDFS and FS are trivial since we have directory delete methods available for them
  • Enhances third party tests to ensure the new functionality works (I manually ran them for all cloud providers)

@original-brownbear original-brownbear added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.3.0 labels Jun 17, 2019
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left 2 comments, looking good o.w.

}
toDelete.addAll(container.listBlobs().keySet());
container.deleteBlobsIgnoringIfNotExists(toDelete);
assertChildren(path, Collections.emptyList());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also assert that the current path does not exist as child in the parent path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ... this was a little awkward with the repo root case I guess that's why I dropped it, but yea let me add that real quick :)

});
}

private void asyncDelete(ActionListener<Void> listener) throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why recurse by level here and not list all sublevels at once?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming you're talking about flat listing all the blobs and then just deleting one by one like I did for GCS and S3?:
Mainly just to keep the changes size down a little for now as that didn't get us much here: The gain of being more effective when listing here is kinda small on Azure since we delete one by one (so the time that goes into listing requests is a lot less relevant overall than it would be if we deleted 100 or 1k blobs at a time). I can do a bit of a bigger refactoring though and move the parallelization logic into the blob store. Let me look into that :)

@original-brownbear
Copy link
Copy Markdown
Contributor Author

@ywelsch are you ok with optimizing the Azure implementation in a follow up? I can only move to flat listing and deleting by moving all the logic for the parallelization into org.elasticsearch.repositories.azure.AzureStorageService which changes the constructor signatures of a lot of stuff in the Azure plugin. Maybe it's better to isolate that change in a separate PR?

The comment on the tests I addressed and made them properly check the deleted directory itself as well :)

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Copy Markdown
Contributor Author

thanks @ywelsch

@original-brownbear original-brownbear merged commit 9a4ffa3 into elastic:master Jun 18, 2019
@original-brownbear original-brownbear deleted the recursive-delete-on-container branch June 18, 2019 11:28
original-brownbear added a commit that referenced this pull request Jul 1, 2019
* Follow up to #43281:
* Optimizing the Azure directory delete operation:
   * Same as with GCS and S3 we can simply flat list a prefix and then delete as we iterate instead of listing the directories recursively. This should require fewer actual list RPC calls and the logic becomes simpler
original-brownbear added a commit that referenced this pull request Jul 3, 2019
This is a prerequisite of #42189:

* Add directory delete method to blob container specific to each implementation:
  * Some notes on the implementations:
       * AWS + GCS: We can simply exploit the fact that both AWS and GCS return blobs lexicographically ordered which allows us to simply delete in the same order that we receive the blobs from the listing request. For AWS this simply required listing without the delimiter setting (so we get a deep listing) and for GCS the same behavior is achieved by not using the directory mode on the listing invocation. The nice thing about this is, that even for very large numbers of blobs the memory requirements are now capped nicely since we go page by page when deleting.
       * For Azure I extended the parallelization to the listing calls as well and made it work recursively. I verified that this works with thread count `1` since we only block once in the initial thread and then fan out to a "graph" of child listeners that never block.
       * HDFS and FS are trivial since we have directory delete methods available for them
* Enhances third party tests to ensure the new functionality works (I manually ran them for all cloud providers)
mkleen added a commit to crate/crate that referenced this pull request Jun 18, 2020
mkleen added a commit to crate/crate that referenced this pull request Jun 24, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 6, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 6, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 6, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 6, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 6, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 6, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 9, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 9, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 9, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 9, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 9, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 9, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 9, 2020
mkleen added a commit to crate/crate that referenced this pull request Jul 9, 2020
Backport of elastic/elasticsearch#43281
This fixes the snapshot deletion on azure where files would not
have been deleted.
mergify bot pushed a commit to crate/crate that referenced this pull request Jul 9, 2020
Backport of elastic/elasticsearch#43281
This fixes the snapshot deletion on azure where files would not
have been deleted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement v7.3.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants