Skip to content

Conversation

@coryan
Copy link
Contributor

@coryan coryan commented Oct 15, 2021

No description provided.

@github-actions
Copy link

@coryan coryan marked this pull request as ready for review October 15, 2021 18:23
~GcsInputStream() override = default;

Status Close() override {
stream_.Close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to check if stream_.Close() is idempotent (not sure if this is a requirement but I believe all of the other interface implementations allow it). e.g. S3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting with v1.32.0 they are idempotent. Created ARROW-14385 to remind myself of updating any dependencies.

}

Result<int64_t> Tell() const override {
if (!stream_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for this check here, and not in the other methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other methods an invalid stream would result in an error anyway. Happy to add the additional checks if you think they make the code easier to grok.

std::string contents;
for (Result<std::shared_ptr<Buffer>> r = stream->Read(16); r.ok() && (*r)->size() != 0;
r = stream->Read(16)) {
auto buffer = *r;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is ASSERT_OK_AND_ASSIGN in Arrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did what you wanted, let me know if I missed the mark.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just questions about approach. on somethings and checking for consistency with other file system APIs

@coryan
Copy link
Contributor Author

coryan commented Oct 20, 2021

PTAL

@emkornfield
Copy link
Contributor

+1

@ursabot
Copy link

ursabot commented Oct 20, 2021

Benchmark runs are scheduled for baseline = 4ac62d5 and contender = a8e1c81. a8e1c81 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Scheduled] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.54% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.09% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@coryan coryan deleted the ARROW-14345-implement-streaming-read branch October 20, 2021 19:06
@ElenaHenderson
Copy link
Contributor

ElenaHenderson commented Oct 22, 2021

Benchmark runs are scheduled for baseline = 4ac62d5 and contender = a8e1c81. a8e1c81 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.54% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.09% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

if (!stream_.status().ok()) {
return internal::ToArrowStatus(stream_.status());
}
return arrow::SliceMutableBufferSafe(std::move(buffer), 0, stream_.gcount());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is merely slicing the output buffer, but not resizing it. If the actual size is much smaller than nbytes, then there is a significant memory wastage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. On the other hand, a short read from a large buffer could require a lot of copying. I am not sure which is more likely to be a problem. I am happy to make a change if you have some data or intuition as to which one is worse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does GCS return short reads except for the EOF case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all practical purposes, only the EOF case. The exception would be some kind of transient error that is stubborn enough to exhaust the retry policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. In any case, other implementations (such as S3) do resize the buffer, and there doesn't seem to be a reason to act differently here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou
Copy link
Member

pitrou commented Nov 1, 2021

I just posted a belated comment on this PR. Sorry for being so late :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants