-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14347: [C++] random access files for GcsFileSystem #11812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14347: [C++] random access files for GcsFileSystem #11812
Conversation
|
|
|
The |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and a couple small comments.
| const std::string& path) { | ||
| return Status::NotImplemented("The GCS FileSystem is not fully implemented"); | ||
| ARROW_ASSIGN_OR_RAISE(auto p, GcsPath::FromString(path)); | ||
| auto metadata = impl_->GetObjectMetadata(p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, does this add a roundtrip to the server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does. I am trying to ensure that Read() and ReadAt() and Seek() when going back are using the same generation of an object [*]. We could try to use undocumented (and likely to break) APIs to extract the generation without this roundtrip. If it turns out the roundtrip (and I should add, the additional API charges) are really important, then I would rather add a documented API to the C++ client library and then use that here.
[*]: you probably know this, but objects in GCS are versioned. You can have more than one version of the same object, and/or have the "latest" version replaced while you are reading from it. I would think we want all operations in one io::RandomAccessFile to refer to the same generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think we want all operations in one io::RandomAccessFile to refer to the same generation.
Oh, definitely. I just wonder if there's a way to extract the metadata from the first stream creation? Perhaps this is something that the GCS C++ library can provide (assuming the information exists at all at the HTTP level)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And thus my (maybe too obscure) note about "undocumented (and likely to break) APIs". The C++ client library currently returns the HTTP headers, which include thinks like x-goog-generation (not the full metadata). This is only intended for debugging, and I know of future changes that will make these headers unavailable (sometimes, when the user selects a non-HTTP transport).
I can change the C++ client to return the generation in a future-proof API. I would rather do that in a separate PR, after I fix the C++ client library (googleapis/google-cloud-cpp#7677).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change the C++ client to return the generation in a future-proof API. I would rather do that in a separate PR, after I fix the C++ client library
No problem from me.
|
|
||
| std::string RandomLine(int lineno, std::size_t width) { | ||
| auto const fillers = std::string("abcdefghijlkmnopqrstuvwxyz0123456789"); | ||
| std::uniform_int_distribution<std::size_t> d(0, fillers.size() - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably reuse random_string from arrow/testing/util.h instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random_string() produces non-printable characters, which are not the easiest thing to troubleshoot. random_ascii() could do it, but has a weird API using a uint8_t* output parameter. I would rather leave this as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough :-)
| auto fs = internal::MakeGcsFileSystemForTest(TestGcsOptions()); | ||
|
|
||
| auto result = fs->OpenInputFile(NotFoundObjectPath()); | ||
| EXPECT_EQ(result.status().code(), StatusCode::IOError); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use ASSERT_RAISES(IOError, fs->OpenInputFile(...)) here, and below as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| EXPECT_EQ(result.status().code(), StatusCode::IOError); | ||
|
|
||
| ASSERT_OK_AND_ASSIGN(info, fs->GetFileInfo(NotFoundObjectPath())); | ||
| result = fs->OpenInputFile(NotFoundObjectPath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to call fs->OpenInputFile(info) instead here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks. Fixed.
| arrow::fs::FileInfo info; | ||
| ASSERT_OK_AND_ASSIGN(info, fs->GetFileInfo(kPreexistingBucket)); | ||
|
|
||
| auto result = fs->OpenInputFile(NotFoundObjectPath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to call fs->OpenInputFile(info) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, fixed too.
|
Benchmark runs are scheduled for baseline = f589610 and contender = 5227f24. 5227f24 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.