-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15121: [C++] Implement max recursion on GcsFileSystem #11969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15121: [C++] Implement max recursion on GcsFileSystem #11969
Conversation
Recursive listing without limit is about as expensive as only listing the top-level directories in GCS. Therefore, it is just *more* efficient to filter the results on the client-side, as this requires fewer request than listing only only 0, 1, or N levels in a directory hierarchy. I also improved the tests to verify no objects with similar prefixes are returned, for example, when listing objects starting with 'aaa' we do not want 'aaab', but we want 'aaa/b'.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Just two minor comments.
| return result; | ||
| } | ||
|
|
||
| std::size_t Depth(arrow::util::string_view path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but as per the coding guidelines (which are copied over from the Google C++ style guide :-)), we generally use explicit-width types such as int32_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| arrow::fs::AssertFileInfo(fs.get(), PreexistingBucketPath(), FileType::Directory); | ||
| arrow::fs::AssertFileInfo(fs.get(), PreexistingObjectPath(), FileType::File); | ||
| for (auto const& info : hierarchy.contents) { | ||
| if (!fs::internal::IsAncestorOf(hierarchy.base_dir, info.path())) continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could actually check that this entry still exists instead of ignoring it.
(same comment below too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| arrow::fs::AssertFileInfo(fs.get(), PreexistingBucketPath(), FileType::Directory); | ||
| arrow::fs::AssertFileInfo(fs.get(), PreexistingObjectPath(), FileType::File); | ||
| for (auto const& info : hierarchy.contents) { | ||
| if (!fs::internal::IsAncestorOf(hierarchy.base_dir, info.path())) continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do the same change as well here? (check the expected type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, sorry about that.
|
Benchmark runs are scheduled for baseline = e7c2ead and contender = 61745be. 61745be is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Recursive listing without limit is about as expensive as only listing
the top-level directories in GCS. Therefore, it is just more efficient
to filter the results on the client-side, as this requires fewer request
than listing only only 0, 1, or N levels in a directory hierarchy.
I also improved the tests to verify no objects with similar prefixes are
returned, for example, when listing objects starting with 'aaa' we do
not want 'aaab', but we want 'aaa/b'.