-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14346: [C++] Implement GcsFileSystem::OpenOutputStream #11550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14346: [C++] Implement GcsFileSystem::OpenOutputStream #11550
Conversation
|
Rebased to resolve conflicts. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @coryan . Here are some comments.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears tell_ is never updated anywhere. Should you do it in Write perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does last_status also clear the error status or is it sticky? If it's sticky, then a failed Write would also return an error when calling Close?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is sticky. And yes, a failed Write() will make subsequent Close() fail. It is unadvisable to finalize a stream that failed, you don't know what is its state.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to define a macro that would allow you to write:
ARROW_GCS_RETURN_NOT_OK(stream.last_status());There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can write Result<gcs::EncryptionKey>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is absl already an include-time dependency of GCS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Some Abseil types are exposed in the public API for google-cloud-cpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason for using this casing? The S3 implementation uses "Cache-Control", "Content-Type", etc., so I'd rather keep the same exact naming rather than reinvent another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the names of the fields in the GCS metadata:
https://cloud.google.com/storage/docs/json_api/v1/objects
It seems odd to make them match the protocol headers over HTTP (and the S3 API at that), 🤷
I have changed them, I can change them back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the concern here is to have a common vocabulary where possible. If the user can spell {"Content-Type": "text/html"} regardless of the filesystem type, it's better than having to special-case the metadata key name.
As for the exact spelling, well, S3 has the precedence here since it was implemented first :-) The fact that the names may also match HTTP header names actually makes them well-known to users, so that's not a problem IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but why not std::unordered_map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is meant to allow inserting arbitrary metadata strings? Will GCS balk if the user throws some unrecognized metadata keys here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCS accepts (mostly) arbitrary metadata keys:
https://cloud.google.com/storage/docs/metadata#custom-metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps choose those string sizes so that they are not multiples of the underlying buffer size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: buffer should never be null in this context. If it is, it's a bug in Read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
PTAL |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the update @coryan !
|
Benchmark runs are scheduled for baseline = 1420544 and contender = 2eafc60. 2eafc60 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.