-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-38701: [C++][FS][Azure] Implement DeleteDirContents()
#38888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
@Tom-Newton @felipecrv You may want to review this. |
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a PathNotFound error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be me being overly cautious, but maybe we shouldn't stop when Blobs.empty(), but instead run continue; and evaluate MoveToNextPage() again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing: instead of while, you can use for (; list_response.HasPage(); list_response.MoveToNextPage()) { here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of HasPage() and !Blobs.empty() checks are needed to avoid empty SubmitBatch() request.
If there are no blobs under the given path, ListBlobs() returns HasPage() == true and Blobs.empty() == true response.
I'll use for as you suggested. I also found an azure-sdk-for-cpp document that uses for: https://github.com/Azure/azure-sdk-for-cpp/blob/main/sdk/storage/MigrationGuide.md#listing-blobs-in-a-container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ListBlobs() returns HasPage() == true and Blobs.empty() == true response.
Is there a chance of that not being the last page? I see that you now only continue; instead of breaking the entire loop which I think is the more robust thing to do. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Hmm. I think that the case (empty blob but it's not the last response) isn't happen but I'm not sure because Azure Blob Storage is a proprietary product...)
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip when the blob_item.Name matches the options.Prefix because that could be the empty-dir marker blob (an empty blob ending with a /).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/*missing_dir_ok=*/true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll add the comment.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be interpreted as deleting the contents of the container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a special API for this case: DeleteRootDirContents()
arrow/cpp/src/arrow/filesystem/filesystem.h
Lines 233 to 249 in 63353ba
| /// Delete a directory's contents, recursively. | |
| /// | |
| /// Like DeleteDir, but doesn't delete the directory itself. | |
| /// Passing an empty path ("" or "/") is disallowed, see DeleteRootDirContents. | |
| virtual Status DeleteDirContents(const std::string& path, | |
| bool missing_dir_ok = false) = 0; | |
| /// Async version of DeleteDirContents. | |
| virtual Future<> DeleteDirContentsAsync(const std::string& path, | |
| bool missing_dir_ok = false); | |
| /// EXPERIMENTAL: Delete the root directory's contents, recursively. | |
| /// | |
| /// Implementations may decide to raise an error if this operation is | |
| /// too dangerous. | |
| // NOTE: may decide to remove this if it's deemed not useful | |
| virtual Status DeleteRootDirContents() = 0; |
So I think that we should return an error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what is the root dir? Doesn't that mean all the containers?
s3fs maps the root-dir concept to all containers:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L2762
Refusing to delete contents of a container can be very limiting as CreateDir(container/) will create a container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify this discussion?
What should be "root dir"?
: Empty location (no container and no path)containerandcontainer/: Container onlycontainer/path: Container and path
My opinion and the current implementation:
- Root dir
- Root dir
- Not root dir
FYI: s3fs:
: Empty path (no bucket and no key): Root dirbucketandbucket/: Bucket only: Not root dirbucket/key: Bucket and key: Not root dir
@Tom-Newton What do you think about this as an Azure Blob Storage user? I'm not an Azure Blob Storage user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider the storage account the root dir. So that means only 1 is the root dir. I believe this is how the fsspec implementation and the original arrow azurefs PR did it.
I think when comparing azure to s3, the Azure storage account is analogous to the S3 bucket and the Azure container is an extra level that S3 doesn't have. I think 1 doesn't really reference any storage on S3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also an allow_bucket_deletion [1] in s3fs. An analogous to that, but for Azure (let's say allow_container_deletion) would be relevant in this PR.
[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L2718
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably I need to review the PR properly to understand the context, but I don't feel strongly. Deleting whole containers is something that will be done very rarely, and probably even more rare for it to be done from a filesystem abstraction. Personally I've always used terraform to create and delete containers.
I will try to review tomorrow morning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allow_bucket_deletion option is for DeleteDir() not DeleteDirContents().
And this PR is for DeleteDirContents().
So it should be discussed separately.
I've opened a new issue for it: #38999
We should discuss it further on the issue not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I'm happy with how this is currently implemented.
For the sake of argument though... I think allow_container_deletion is relevant to DeleteDirContents in the case of location.container.empty(). I would expect the behaviour in this case to be to delete all containers in the storage account. However, I'm happy to stick with just returning an error because it seems quite dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If allow_container_deletion is relevant to DeleteDirContents() and location.container.empty() is true, the case should be treated in DeleteRootDirContents() not DeleteDirContents().
So we can defer it to #38999 anyway. :-)
|
Updated:
|
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what is the root dir? Doesn't that mean all the containers?
s3fs maps the root-dir concept to all containers:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L2762
Refusing to delete contents of a container can be very limiting as CreateDir(container/) will create a container.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ListBlobs() returns HasPage() == true and Blobs.empty() == true response.
Is there a chance of that not being the last page? I see that you now only continue; instead of breaking the entire loop which I think is the more robust thing to do. 👍
|
Updated:
|
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When location.path is empty, the Prefix should not be set or be "" (without the trailing slash) as the blob names within the container don't ever start with a /.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EnsureTrailingSlash() does nothing for empty string:
arrow/cpp/src/arrow/filesystem/path_util.cc
Lines 151 to 159 in d66780d
| std::string EnsureTrailingSlash(std::string_view v) { | |
| if (v.length() > 0 && v.back() != kSep) { | |
| // XXX How about "C:" on Windows? We probably don't want to turn it into "C:/"... | |
| // Unless the local filesystem always uses absolute paths | |
| return std::string(v) + kSep; | |
| } else { | |
| return std::string(v); | |
| } | |
| } |
Should we add a (needless) special condition such as the following for readability?
if (!location.path.empty()) {
options.Prefix = internal::EnsureTrailingSlash(location.path);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just leave a comment saying that EnsureTrailingSlash doesn't alter empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. From the name, I though it ensured a trailing slash on the output. 🥲
I like this because Prefix is a nullable, so not doing anything to it makes even more sense than setting it to a "" value.
if (!location.path.empty()) {
options.Prefix = internal::EnsureTrailingSlash(location.path);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll use the if (!location.path.empty()).
Tom-Newton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could all of this be extracted to a member function of AzureFileSystemTest? It looks like this has been copy pasted in 3 of the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll extract them as a method.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just leave a comment saying that EnsureTrailingSlash doesn't alter empty string.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I'm happy with how this is currently implemented.
For the sake of argument though... I think allow_container_deletion is relevant to DeleteDirContents in the case of location.container.empty(). I would expect the behaviour in this case to be to delete all containers in the storage account. However, I'm happy to stick with just returning an error because it seems quite dangerous.
felipecrv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few nits by @Tom-Newton for you to fix but correctness-wise, I think this is ready for merge.
2e40f6b to
651769a
Compare
|
Updated:
I'll merge this once CI is green. |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit db69b67. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…che#38888) ### Rationale for this change `DeleteDirContents()` deletes the given directory contents recursively like other filesystem implementations. Azure file system treats the following cases as root directory: * Empty container * Empty path ### What changes are included in this PR? List and delete approach is used with/without hierarchical namespace support. Because Azure doesn't provide "DeleteDirContents" API with hierarchical namespace support. We may want to use delete the given directory and create an empty approach instead of list and delete approach for performance but it may not be acceptable with some use cases. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: apache#38701 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Rationale for this change
DeleteDirContents()deletes the given directory contents recursively like other filesystem implementations.Azure file system treats the following cases as root directory:
What changes are included in this PR?
List and delete approach is used with/without hierarchical namespace support. Because Azure doesn't provide "DeleteDirContents" API with hierarchical namespace support. We may want to use delete the given directory and create an empty approach instead of list and delete approach for performance but it may not be acceptable with some use cases.
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
DeleteDirContents()#38701