-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14893: [C++] Allow creating GCS filesystem from URI #12475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is mostly adapted from s3. One place that it differs is it doesn't allow for username and password. The only thing that I could see supporting in the future is taking username as user to impersonate, but we can see if there is demand for that feature. I also added in two useful features in s3: - Option for setting a default location to create a bucket in. - Default key-value metadata to use for object creation if none is specified. Also fixed a typo in s3fs error message.
|
CC @coryan |
|
|
coryan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for all the fixes.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @emkornfield . A couple minor comments and questions.
| } | ||
| return std::make_shared<LocalFileSystem>(options, io_context); | ||
| } | ||
| if (scheme == "gs") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can/should we also accept "gcs" for compatibility with https://gcsfs.readthedocs.io/en/latest/#integration ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| "https"}; | ||
| "https", | ||
| /*default_bucket_location=*/{}, | ||
| /*default_metadata=*/nullptr}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of spelling all default parameter values explicitly, it would probably more readable and maintainable to set the non-default attributes individually e.g.;
GcsOptions options{};
options.credentials = std::make_shared<GcsCredentials>(google::cloud::MakeInsecureCredentials());
options.schema = "http";
return options;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/arrow/filesystem/gcsfs.h
Outdated
|
|
||
| std::string endpoint_override; | ||
| std::string scheme; | ||
| // \brief Location to use for creating buckets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of value can a location have? Is it a "region" like in S3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are locations that are not regions:
https://cloud.google.com/storage/docs/locations#location-mr
Probably not of much interest for analytics workloads, but who knows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always thought of multi-regions as magic regions instead of a different concept. Unfortunately, people do use multi-regions for analytics workloads and experience the associated pain points.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
This reverts commit 30bd844.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks for the update @emkornfield
|
Benchmark runs are scheduled for baseline = 53e1296 and contender = 8244568. 8244568 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
Is any additional work required to make this accessible from pyarrow / |
|
This change hadn't been released. I think with this change datsets should work. I'll hopefully have a PR up this week for pyarrow wrappers |
This is mostly adapted from s3. One place that it differs
is it doesn't allow for username and password. The only thing
that I could see supporting in the future is taking username as
user to impersonate, but we can see if there is demand for that feature.
I also added in two useful features in s3:
is specified.
Also fixed a typo in s3fs error message.