GH-43833: [Integration] validate that arrow IPC files include a valid embedded IPC stream#43834
GH-43833: [Integration] validate that arrow IPC files include a valid embedded IPC stream#43834bkietz wants to merge 3 commits intoapache:mainfrom
Conversation
| ARROW_ASSIGN_OR_RAISE(auto footer_cookie, arrow_file->ReadAt(file_size - 10, 10)); | ||
| auto footer_size = | ||
| bit_util::FromLittleEndian(util::SafeLoadAs<int32_t>(footer_cookie->data())); | ||
| int64_t footer_offset = 8 + file_size - footer_size - 10; |
There was a problem hiding this comment.
The 8 + is a bit cryptic. I realize it accounts for the 8-byte padded magic at the start of file, but a comment would make this easier to understand.
Or perhaps you can change the checks to something like:
// The footer is followed by the 10-byte footer length + magic
const int64_t footer_offset = file_size - footer_size - 10;
// Make sure the stream ends before the footer
ARROW_ASSIGN_OR_RAISE(const int64_t stream_size, stream->Tell());
const int64_t stream_end_offset = stream_size + 8;
if (stream_end_offset > footer_offset) {| } | ||
| ARROW_ASSIGN_OR_RAISE(int64_t stream_size, stream->Tell()); | ||
|
|
||
| if (footer_offset <= stream_size) { |
There was a problem hiding this comment.
Hmm, why not footer_offset < stream_size? It should be ok if the stream ends just before the footer.
|
Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer. |
Rationale for this change
IPC files should contain a valid IPC stream with EOS. Some arrow producers don't guarantee that the EOS is written. However there might be readers depending on simply reading the embedded IPC stream (and depending on the EOS to avoid reading the Footer as a corrupted message) so we should validate stream reading from IPC files as well as file reading.
What changes are included in this PR?
arrow-json-integration-test will now also try to read a file as a stream, which will fail if the file doesn't include an EOS or otherwise isn't a valid stream.
Are these changes tested?
yes
Are there any user-facing changes?
no