Skip to content

Conversation

@coryan
Copy link
Contributor

@coryan coryan commented Oct 26, 2021

No description provided.

@github-actions
Copy link

@coryan coryan marked this pull request as ready for review October 26, 2021 21:54
@kszucs kszucs requested a review from pitrou October 27, 2021 11:31
@coryan
Copy link
Contributor Author

coryan commented Oct 29, 2021

Rebased to resolve conflicts.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @coryan . Here are some comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears tell_ is never updated anywhere. Should you do it in Write perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does last_status also clear the error status or is it sticky? If it's sticky, then a failed Write would also return an error when calling Close?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is sticky. And yes, a failed Write() will make subsequent Close() fail. It is unadvisable to finalize a stream that failed, you don't know what is its state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to define a macro that would allow you to write:

ARROW_GCS_RETURN_NOT_OK(stream.last_status());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can write Result<gcs::EncryptionKey>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is absl already an include-time dependency of GCS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Some Abseil types are exposed in the public API for google-cloud-cpp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for using this casing? The S3 implementation uses "Cache-Control", "Content-Type", etc., so I'd rather keep the same exact naming rather than reinvent another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the names of the fields in the GCS metadata:

https://cloud.google.com/storage/docs/json_api/v1/objects

It seems odd to make them match the protocol headers over HTTP (and the S3 API at that), 🤷

I have changed them, I can change them back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the concern here is to have a common vocabulary where possible. If the user can spell {"Content-Type": "text/html"} regardless of the filesystem type, it's better than having to special-case the metadata key name.

As for the exact spelling, well, S3 has the precedence here since it was implemented first :-) The fact that the names may also match HTTP header names actually makes them well-known to users, so that's not a problem IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but why not std::unordered_map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to allow inserting arbitrary metadata strings? Will GCS balk if the user throws some unrecognized metadata keys here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS accepts (mostly) arbitrary metadata keys:

https://cloud.google.com/storage/docs/metadata#custom-metadata

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps choose those string sizes so that they are not multiples of the underlying buffer size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: buffer should never be null in this context. If it is, it's a bug in Read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@coryan
Copy link
Contributor Author

coryan commented Nov 4, 2021

PTAL

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the update @coryan !

@pitrou pitrou closed this in 2eafc60 Nov 8, 2021
@ursabot
Copy link

ursabot commented Nov 8, 2021

Benchmark runs are scheduled for baseline = 1420544 and contender = 2eafc60. 2eafc60 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.67% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@coryan coryan deleted the ARROW-14346-implement-GcsFileSystem-streaming-writes branch November 8, 2021 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants