Skip to content

Conversation

@coryan
Copy link
Contributor

@coryan coryan commented Nov 30, 2021

No description provided.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@coryan coryan marked this pull request as ready for review November 30, 2021 17:09
@coryan
Copy link
Contributor Author

coryan commented Nov 30, 2021

The C++ / AMD64 Conda C++ failure seems unrelated.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and a couple small comments.

const std::string& path) {
return Status::NotImplemented("The GCS FileSystem is not fully implemented");
ARROW_ASSIGN_OR_RAISE(auto p, GcsPath::FromString(path));
auto metadata = impl_->GetObjectMetadata(p);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, does this add a roundtrip to the server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. I am trying to ensure that Read() and ReadAt() and Seek() when going back are using the same generation of an object [*]. We could try to use undocumented (and likely to break) APIs to extract the generation without this roundtrip. If it turns out the roundtrip (and I should add, the additional API charges) are really important, then I would rather add a documented API to the C++ client library and then use that here.

[*]: you probably know this, but objects in GCS are versioned. You can have more than one version of the same object, and/or have the "latest" version replaced while you are reading from it. I would think we want all operations in one io::RandomAccessFile to refer to the same generation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think we want all operations in one io::RandomAccessFile to refer to the same generation.

Oh, definitely. I just wonder if there's a way to extract the metadata from the first stream creation? Perhaps this is something that the GCS C++ library can provide (assuming the information exists at all at the HTTP level)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thus my (maybe too obscure) note about "undocumented (and likely to break) APIs". The C++ client library currently returns the HTTP headers, which include thinks like x-goog-generation (not the full metadata). This is only intended for debugging, and I know of future changes that will make these headers unavailable (sometimes, when the user selects a non-HTTP transport).

I can change the C++ client to return the generation in a future-proof API. I would rather do that in a separate PR, after I fix the C++ client library (googleapis/google-cloud-cpp#7677).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change the C++ client to return the generation in a future-proof API. I would rather do that in a separate PR, after I fix the C++ client library

No problem from me.


std::string RandomLine(int lineno, std::size_t width) {
auto const fillers = std::string("abcdefghijlkmnopqrstuvwxyz0123456789");
std::uniform_int_distribution<std::size_t> d(0, fillers.size() - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably reuse random_string from arrow/testing/util.h instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random_string() produces non-printable characters, which are not the easiest thing to troubleshoot. random_ascii() could do it, but has a weird API using a uint8_t* output parameter. I would rather leave this as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough :-)

auto fs = internal::MakeGcsFileSystemForTest(TestGcsOptions());

auto result = fs->OpenInputFile(NotFoundObjectPath());
EXPECT_EQ(result.status().code(), StatusCode::IOError);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use ASSERT_RAISES(IOError, fs->OpenInputFile(...)) here, and below as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

EXPECT_EQ(result.status().code(), StatusCode::IOError);

ASSERT_OK_AND_ASSIGN(info, fs->GetFileInfo(NotFoundObjectPath()));
result = fs->OpenInputFile(NotFoundObjectPath());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to call fs->OpenInputFile(info) instead here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks. Fixed.

arrow::fs::FileInfo info;
ASSERT_OK_AND_ASSIGN(info, fs->GetFileInfo(kPreexistingBucket));

auto result = fs->OpenInputFile(NotFoundObjectPath());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to call fs->OpenInputFile(info) instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, fixed too.

@pitrou pitrou closed this in 5227f24 Dec 1, 2021
@ursabot
Copy link

ursabot commented Dec 1, 2021

Benchmark runs are scheduled for baseline = f589610 and contender = 5227f24. 5227f24 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.09% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@coryan coryan deleted the ARROW-14347-implement-gcs-random-access-reads branch December 1, 2021 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants