-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-38700: [C++][FS][Azure] Implement DeleteDir()
#38793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
@Tom-Newton @felipecrv You may want to review this. |
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no actual directory but there could be blobs that are considered part of this implied directory. I think in this case we should delete those blobs.
I think that will require listing blobs for the prefix (internal::EnsureTrailingSlash(location.path)) then iterating through the result and deleting all those blobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @Tom-Newton here. Azure API might have an endpoint to delete all blobs with a certain prefix, so we don't necessarily have to loop from the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it makes sense. I'll do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It seems that Azure Blob Storage doesn't provide an API that deletes blobs by prefix.
I'll implement it with the list and delete approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a arrow::fs::AssertFileInfo(fs_.get(), path, FileType::NotFound); at the end? Personally I would probably also add an assertion that the container does exist before deleting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add a test case for a non-empty "virtual directory" where we expect the contents of the directory to be deleted.
Would be nice to add a non-empty test case for hierarchical namespace too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add a test case for a non-empty "virtual directory" where we expect the contents of the directory to be deleted.
OK. I'll add it.
Would be nice to add a non-empty test case for hierarchical namespace too.
AzureHierarchicalNamespaceFileSystemTest.DeleteDirSuccess is for the case.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @Tom-Newton here. Azure API might have an endpoint to delete all blobs with a certain prefix, so we don't necessarily have to loop from the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it delete all the blobs that start with path/to/dir/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It make sense. I'll implement it and add a test for the case.
7e25d13 to
7212e6a
Compare
|
Updated:
|
cpp/src/arrow/filesystem/azurefs.cc
Outdated
| return Status::IOError("Failed to delete a blob: ", blob_item.Name, | ||
| ": " + container_client.GetUrl()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would save this on a Status variable and let all the loops continue, then check the saved status at the end.
This way we don't waste the effort made to list the blobs and a caller retrying after we return an error at the end is more likely to have less work to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could cancel the outer loop listing blobs if an error was detected, but I think GetResponse should be called on all the deferred responses we have sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to continue the loop to avoid wasting the list effort. Because we already submitted all delete requests by SubmitBatch(). This loop just checks the results.
I think
GetResponseshould be called on all the deferred responses we have sent.
Why? Is it for avoiding a resource leak? I think that not calling GetResponse() doesn't leak any resource. Because I think that GetResponse() just returns a sub response returned by SubmitBatch(). I think that all sub responses are managed by SubmitBatch() response and the response is already read.
See also a SubmitBatch() response example: https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch#sample-response
Anyway, I'll check all deferred responses to show all failed blob names in error message. It'll help debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I thought GetResponse would block if necessary, but all the futures are waited for during SubmitBatch.
It could block, but in the specific case of batched requests, they are all at ready state ->
https://github.com/Azure/azure-sdk-for-cpp/blob/4a32d7266cfac8bfc0eb87feb56011361a36f43c/sdk/storage/azure-storage-blobs/src/blob_batch.cpp#L248
This a bit disappointing because it reduces the parallelism that could be exploited, but surely reduces the risk of API misuse.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
| auto list_response = container_client.ListBlobs(options); | ||
| while (list_response.HasPage() && !list_response.Blobs.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are the ListBlobs errors handled here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry. I missed it. I'll add try/catch for it.
|
It's an Azurite bug: Azure/Azurite#2302 Can we skip the test only on macOS? |
|
Updated:
I'll merge this once CI is finished with green. |
When running Nothing is wrong with the buffers passed to the function, so it must be a false positive coming from OpenSSL code. |
|
@felipecrv Oh... Could you open a new issue for this with full log? I'll merge this for now because our CI doesn't have ASAN on macOS. (We have it for Linux but it doesn't report this error: https://github.com/apache/arrow/actions/runs/6970987063/job/18970050943?pr=38793 ) We can work on it as a separated task. |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7da7895. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change
`DeleteDir()` deletes the given directory recursively like other filesystem implementations.
### What changes are included in this PR?
* Container can be deleted with/without hierarchical namespace support.
* Directory can be deleted with hierarchical namespace support.
* Directory can't be deleted without hierarchical namespace support. But blobs under the given path can be deleted. So these blobs are deleted and the given virtual directory is also deleted.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* Closes: apache#38700
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Rationale for this change
DeleteDir()deletes the given directory recursively like other filesystem implementations.What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
DeleteDir()#38700