feat(arrow/ipc): implement lazy loading/zero-copy for IPC files#216
feat(arrow/ipc): implement lazy loading/zero-copy for IPC files#216zeroshade merged 2 commits intoapache:mainfrom
Conversation
|
@vtk9 can you give this a try and confirm that it addresses your issue? I added unit tests which confirm that the pointers match and that we're avoiding allocations but it would be great to confirm on your end that this reduces the memory usage you were seeing. |
|
@lidavidm @kou @joellubi would one of you be able to look this over and give a review? I wanna merge this and then kick off an RC as requested from #218 (comment) |
|
I'll try to get to it on Monday |
arrow/ipc/file_reader.go
Outdated
| func (r *basicReaderImpl) readFooter(f *footerBlock) error { | ||
| var err error | ||
|
|
||
| if f.offset <= int64(len(Magic)*2+4) { |
There was a problem hiding this comment.
Can we add a variable for len(Magic)*2+4 because we use this minimum size multiple times in this file?
arrow/ipc/file_reader.go
Outdated
| func (r *basicReaderImpl) readFooter(f *footerBlock) error { | ||
| var err error | ||
|
|
||
| if f.offset <= int64(len(Magic)*2+4) { |
There was a problem hiding this comment.
Can we give 4 a name something like footerSizeLen or something?
arrow/ipc/file_reader.go
Outdated
| return errNotArrowFile | ||
| } | ||
|
|
||
| size := int64(binary.LittleEndian.Uint32(buf[:4])) |
There was a problem hiding this comment.
I think that this should be int32 not uint32: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
<FOOTER SIZE: int32>
There was a problem hiding this comment.
I think this is because Golang only has the unsigned versions, expecting you to convert to signed integer yourself
There was a problem hiding this comment.
@lidavidm is correct, only the unsigned versions are available and you are expected to convert to signed integer yourself if you need.
arrow/ipc/file_reader.go
Outdated
| return errNotArrowFile | ||
| } | ||
|
|
||
| size := int64(binary.LittleEndian.Uint32(buf[:4])) |
| metaBytes := buf[:blk.meta] | ||
|
|
||
| prefix := 0 | ||
| switch binary.LittleEndian.Uint32(metaBytes) { |
There was a problem hiding this comment.
int32 not uint32?
https://github.com/apache/arrow/blob/313d11aa94c2be71142b55e3d8bb166d780c19c7/format/File.fbs#L45
metaDataLength: int;
arrow/ipc/file_reader.go
Outdated
|
|
||
| func (r *mappedReaderImpl) getFooterEnd() (int64, error) { return int64(len(r.data)), nil } | ||
|
|
||
| func (r *mappedReaderImpl) readFooter(f *footerBlock) error { |
There was a problem hiding this comment.
Can we unify more codes in this function with basicReaderImple.readFooter()?
It seems that they have many similar codes.
There was a problem hiding this comment.
It's a bit hard to unify these more since the core part of it (reading the bytes) is where the change is but I'll see what I can do
There was a problem hiding this comment.
updated by adding a getBytes method to the impls and having a single implementation for readFooter that uses it, unifying the implementations.
arrow/ipc/file_reader.go
Outdated
| return errNotArrowFile | ||
| } | ||
|
|
||
| size := int64(binary.LittleEndian.Uint32(buf[:4])) |
There was a problem hiding this comment.
I think this is because Golang only has the unsigned versions, expecting you to convert to signed integer yourself
arrow/ipc/file_reader.go
Outdated
|
|
||
| var r io.Reader = sr | ||
| // check for an uncompressed buffer | ||
| if int64(uncompressedSize) != -1 { |
There was a problem hiding this comment.
nit: while this is existing code, maybe it would be safer to have uncompressedSize := int64(...) so you don't have to remember to convert it on use and so it's consistent with above
|
@kou can you take another look and let me know if I covered what you were thinking? Thanks! |
Rationale for this change
closes #207
What changes are included in this PR?
Adding new method
NewMappedFileReaderto ipc package which accepts a byte slice instead of aReaderAtSeeker. UpdatesipcSourceto reference the raw byte slices from the input directly instead of wrapping withbytes.NewReaderwhich forces copies viaRead,ReadFull, etc.Are these changes tested?
Unit tests added to confirm that the pointers match and that we aren't allocating unnecessarily.
Are there any user-facing changes?
Shouldn't be any user-facing changes other than a reduction in memory usage when reading non-compressed IPC data.