feat(arrow/ipc): implement lazy loading/zero-copy for IPC files by zeroshade · Pull Request #216 · apache/arrow-go

zeroshade · 2024-12-10T00:43:18Z

Rationale for this change

closes #207

What changes are included in this PR?

Adding new method NewMappedFileReader to ipc package which accepts a byte slice instead of a ReaderAtSeeker. Updates ipcSource to reference the raw byte slices from the input directly instead of wrapping with bytes.NewReader which forces copies via Read, ReadFull, etc.

Are these changes tested?

Unit tests added to confirm that the pointers match and that we aren't allocating unnecessarily.

Are there any user-facing changes?

Shouldn't be any user-facing changes other than a reduction in memory usage when reading non-compressed IPC data.

zeroshade · 2024-12-10T00:44:49Z

@vtk9 can you give this a try and confirm that it addresses your issue? I added unit tests which confirm that the pointers match and that we're avoiding allocations but it would be great to confirm on your end that this reduces the memory usage you were seeing.

zeroshade · 2024-12-13T21:42:47Z

@lidavidm @kou @joellubi would one of you be able to look this over and give a review?

I wanna merge this and then kick off an RC as requested from #218 (comment)

lidavidm · 2024-12-14T01:42:31Z

I'll try to get to it on Monday

kou · 2024-12-14T01:23:34Z

arrow/ipc/file_reader.go

+func (r *basicReaderImpl) readFooter(f *footerBlock) error {
+	var err error
+
+	if f.offset <= int64(len(Magic)*2+4) {


Can we add a variable for len(Magic)*2+4 because we use this minimum size multiple times in this file?

kou · 2024-12-14T01:24:54Z

arrow/ipc/file_reader.go

+func (r *basicReaderImpl) readFooter(f *footerBlock) error {
+	var err error
+
+	if f.offset <= int64(len(Magic)*2+4) {


Can we give 4 a name something like footerSizeLen or something?

kou · 2024-12-14T01:30:31Z

arrow/ipc/file_reader.go

+		return errNotArrowFile
+	}
+
+	size := int64(binary.LittleEndian.Uint32(buf[:4]))


I think that this should be int32 not uint32: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

<FOOTER SIZE: int32>

I think this is because Golang only has the unsigned versions, expecting you to convert to signed integer yourself

https://pkg.go.dev/encoding/binary#ByteOrder

@lidavidm is correct, only the unsigned versions are available and you are expected to convert to signed integer yourself if you need.

kou · 2024-12-14T01:30:43Z

arrow/ipc/file_reader.go

+		return errNotArrowFile
+	}
+
+	size := int64(binary.LittleEndian.Uint32(buf[:4]))


kou · 2024-12-14T01:37:17Z

arrow/ipc/file_reader.go

+	metaBytes := buf[:blk.meta]
+
+	prefix := 0
+	switch binary.LittleEndian.Uint32(metaBytes) {


int32 not uint32?

https://github.com/apache/arrow/blob/313d11aa94c2be71142b55e3d8bb166d780c19c7/format/File.fbs#L45

metaDataLength: int;

kou · 2024-12-14T01:44:23Z

arrow/ipc/file_reader.go

+
+func (r *mappedReaderImpl) getFooterEnd() (int64, error) { return int64(len(r.data)), nil }
+
+func (r *mappedReaderImpl) readFooter(f *footerBlock) error {


Can we unify more codes in this function with basicReaderImple.readFooter()?
It seems that they have many similar codes.

It's a bit hard to unify these more since the core part of it (reading the bytes) is where the change is but I'll see what I can do

updated by adding a getBytes method to the impls and having a single implementation for readFooter that uses it, unifying the implementations.

lidavidm · 2024-12-16T00:22:26Z

arrow/ipc/file_reader.go

+		return errNotArrowFile
+	}
+
+	size := int64(binary.LittleEndian.Uint32(buf[:4]))


I think this is because Golang only has the unsigned versions, expecting you to convert to signed integer yourself

https://pkg.go.dev/encoding/binary#ByteOrder

lidavidm · 2024-12-16T00:23:06Z

arrow/ipc/file_reader.go


-		var r io.Reader = sr
 		// check for an uncompressed buffer
 		if int64(uncompressedSize) != -1 {


nit: while this is existing code, maybe it would be safer to have uncompressedSize := int64(...) so you don't have to remember to convert it on use and so it's consistent with above

zeroshade · 2024-12-16T19:38:37Z

@kou can you take another look and let me know if I covered what you were thinking? Thanks!

kou

+1

feat(arrow/ipc): implement lazy loading/zero-copy for IPC files

efff0a6

zeroshade requested review from joellubi, kou and lidavidm December 10, 2024 00:43

kou reviewed Dec 14, 2024

View reviewed changes

lidavidm approved these changes Dec 16, 2024

View reviewed changes

unify readFooter

0464196

zeroshade requested a review from kou December 16, 2024 19:38

kou approved these changes Dec 17, 2024

View reviewed changes

zeroshade merged commit 95c4e4d into apache:main Dec 18, 2024

zeroshade deleted the lazy-loading-zero-copy branch December 18, 2024 19:57


		func (r *mappedReaderImpl) getFooterEnd() (int64, error) { return int64(len(r.data)), nil }

		func (r mappedReaderImpl) readFooter(f footerBlock) error {

Conversation

zeroshade commented Dec 10, 2024

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zeroshade commented Dec 10, 2024

Uh oh!

zeroshade commented Dec 13, 2024

Uh oh!

lidavidm commented Dec 14, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeroshade commented Dec 16, 2024

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants