Skip to content

Conversation

@kou
Copy link
Member

@kou kou commented Nov 27, 2023

Rationale for this change

DeleteDirContents() deletes the given directory contents recursively like other filesystem implementations.

Azure file system treats the following cases as root directory:

  • Empty container
  • Empty path

What changes are included in this PR?

List and delete approach is used with/without hierarchical namespace support. Because Azure doesn't provide "DeleteDirContents" API with hierarchical namespace support. We may want to use delete the given directory and create an empty approach instead of list and delete approach for performance but it may not be acceptable with some use cases.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@github-actions
Copy link

⚠️ GitHub issue #38701 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member Author

kou commented Nov 27, 2023

@Tom-Newton @felipecrv You may want to review this.

Comment on lines 989 to 990
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a PathNotFound error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be me being overly cautious, but maybe we shouldn't stop when Blobs.empty(), but instead run continue; and evaluate MoveToNextPage() again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing: instead of while, you can use for (; list_response.HasPage(); list_response.MoveToNextPage()) { here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of HasPage() and !Blobs.empty() checks are needed to avoid empty SubmitBatch() request.
If there are no blobs under the given path, ListBlobs() returns HasPage() == true and Blobs.empty() == true response.

I'll use for as you suggested. I also found an azure-sdk-for-cpp document that uses for: https://github.com/Azure/azure-sdk-for-cpp/blob/main/sdk/storage/MigrationGuide.md#listing-blobs-in-a-container

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListBlobs() returns HasPage() == true and Blobs.empty() == true response.

Is there a chance of that not being the last page? I see that you now only continue; instead of breaking the entire loop which I think is the more robust thing to do. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Hmm. I think that the case (empty blob but it's not the last response) isn't happen but I'm not sure because Azure Blob Storage is a proprietary product...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip when the blob_item.Name matches the options.Prefix because that could be the empty-dir marker blob (an empty blob ending with a /).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should do it in #38772 .
We have a test for this case. So we can detect the case when we work on #38772 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/*missing_dir_ok=*/true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll add the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be interpreted as deleting the contents of the container?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a special API for this case: DeleteRootDirContents()

/// Delete a directory's contents, recursively.
///
/// Like DeleteDir, but doesn't delete the directory itself.
/// Passing an empty path ("" or "/") is disallowed, see DeleteRootDirContents.
virtual Status DeleteDirContents(const std::string& path,
bool missing_dir_ok = false) = 0;
/// Async version of DeleteDirContents.
virtual Future<> DeleteDirContentsAsync(const std::string& path,
bool missing_dir_ok = false);
/// EXPERIMENTAL: Delete the root directory's contents, recursively.
///
/// Implementations may decide to raise an error if this operation is
/// too dangerous.
// NOTE: may decide to remove this if it's deemed not useful
virtual Status DeleteRootDirContents() = 0;

So I think that we should return an error here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what is the root dir? Doesn't that mean all the containers?

s3fs maps the root-dir concept to all containers:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L2762

Refusing to delete contents of a container can be very limiting as CreateDir(container/) will create a container.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify this discussion?

What should be "root dir"?

  1. : Empty location (no container and no path)
  2. container and container/: Container only
  3. container/path: Container and path

My opinion and the current implementation:

  1. Root dir
  2. Root dir
  3. Not root dir

FYI: s3fs:

  1. : Empty path (no bucket and no key): Root dir
  2. bucket and bucket/: Bucket only: Not root dir
  3. bucket/key: Bucket and key: Not root dir

@Tom-Newton What do you think about this as an Azure Blob Storage user? I'm not an Azure Blob Storage user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider the storage account the root dir. So that means only 1 is the root dir. I believe this is how the fsspec implementation and the original arrow azurefs PR did it.

I think when comparing azure to s3, the Azure storage account is analogous to the S3 bucket and the Azure container is an extra level that S3 doesn't have. I think 1 doesn't really reference any storage on S3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also an allow_bucket_deletion [1] in s3fs. An analogous to that, but for Azure (let's say allow_container_deletion) would be relevant in this PR.

[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L2718

Copy link
Contributor

@Tom-Newton Tom-Newton Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I need to review the PR properly to understand the context, but I don't feel strongly. Deleting whole containers is something that will be done very rarely, and probably even more rare for it to be done from a filesystem abstraction. Personally I've always used terraform to create and delete containers.

I will try to review tomorrow morning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allow_bucket_deletion option is for DeleteDir() not DeleteDirContents().
And this PR is for DeleteDirContents().
So it should be discussed separately.
I've opened a new issue for it: #38999
We should discuss it further on the issue not here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm happy with how this is currently implemented.

For the sake of argument though... I think allow_container_deletion is relevant to DeleteDirContents in the case of location.container.empty(). I would expect the behaviour in this case to be to delete all containers in the storage account. However, I'm happy to stick with just returning an error because it seems quite dangerous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If allow_container_deletion is relevant to DeleteDirContents() and location.container.empty() is true, the case should be treated in DeleteRootDirContents() not DeleteDirContents().
So we can defer it to #38999 anyway. :-)

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Nov 28, 2023
@kou
Copy link
Member Author

kou commented Nov 28, 2023

Updated:

  • Use PathNotFound()
  • Use for
  • Add an argument name comment
  • Skip a test on macOS that has a problem with Azurite

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what is the root dir? Doesn't that mean all the containers?

s3fs maps the root-dir concept to all containers:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L2762

Refusing to delete contents of a container can be very limiting as CreateDir(container/) will create a container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListBlobs() returns HasPage() == true and Blobs.empty() == true response.

Is there a chance of that not being the last page? I see that you now only continue; instead of breaking the entire loop which I think is the more robust thing to do. 👍

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Nov 29, 2023
@kou
Copy link
Member Author

kou commented Nov 29, 2023

Updated:

  • Accept DeleteDir("container")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When location.path is empty, the Prefix should not be set or be "" (without the trailing slash) as the blob names within the container don't ever start with a /.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EnsureTrailingSlash() does nothing for empty string:

std::string EnsureTrailingSlash(std::string_view v) {
if (v.length() > 0 && v.back() != kSep) {
// XXX How about "C:" on Windows? We probably don't want to turn it into "C:/"...
// Unless the local filesystem always uses absolute paths
return std::string(v) + kSep;
} else {
return std::string(v);
}
}

Should we add a (needless) special condition such as the following for readability?

if (!location.path.empty()) {
  options.Prefix = internal::EnsureTrailingSlash(location.path);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just leave a comment saying that EnsureTrailingSlash doesn't alter empty string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. From the name, I though it ensured a trailing slash on the output. 🥲

I like this because Prefix is a nullable, so not doing anything to it makes even more sense than setting it to a "" value.

if (!location.path.empty()) {
  options.Prefix = internal::EnsureTrailingSlash(location.path);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll use the if (!location.path.empty()).

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 30, 2023
Copy link
Contributor

@Tom-Newton Tom-Newton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy. Thanks

Comment on lines 674 to 686
Copy link
Contributor

@Tom-Newton Tom-Newton Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could all of this be extracted to a member function of AzureFileSystemTest? It looks like this has been copy pasted in 3 of the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll extract them as a method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just leave a comment saying that EnsureTrailingSlash doesn't alter empty string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm happy with how this is currently implemented.

For the sake of argument though... I think allow_container_deletion is relevant to DeleteDirContents in the case of location.container.empty(). I would expect the behaviour in this case to be to delete all containers in the storage account. However, I'm happy to stick with just returning an error because it seems quite dangerous.

Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few nits by @Tom-Newton for you to fix but correctness-wise, I think this is ready for merge.

@kou kou force-pushed the cpp-azurefs-delete-dir-contents branch from 2e40f6b to 651769a Compare December 1, 2023 02:26
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Dec 1, 2023
@kou
Copy link
Member Author

kou commented Dec 1, 2023

Updated:

  • Rebased on main
  • Don't set ListBlobOptions.Prefix when location.path is empty
  • Extract common test data creation codes as a method

I'll merge this once CI is green.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 1, 2023
@kou kou merged commit db69b67 into apache:main Dec 1, 2023
@kou kou deleted the cpp-azurefs-delete-dir-contents branch December 1, 2023 20:43
@kou kou removed the awaiting change review Awaiting change review label Dec 1, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit db69b67.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…che#38888)

### Rationale for this change

`DeleteDirContents()` deletes the given directory contents recursively like other filesystem implementations.

Azure file system treats the following cases as root directory:
* Empty container
* Empty path

### What changes are included in this PR?

List and delete approach is used with/without hierarchical namespace support. Because Azure doesn't provide "DeleteDirContents" API with hierarchical namespace support. We may want to use delete the given directory and create an empty approach instead of list and delete approach for performance but it may not be acceptable with some use cases. 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* Closes: apache#38701

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][FS][Azure] Implement DeleteDirContents()

3 participants