Skip to content

array.Binary and array.String should use int64 offsets #195

@tosinva-stripe

Description

@tosinva-stripe

Describe the bug, including details regarding any error messages, version, and platform.

LargeBinary and LargeString use int64 offsets, however Binary and String types use int32 offsets, this makes them susceptible to slice index out of bounds errors when the column/array is larger than ~2GB ~= 2^31 bytes.

To reproduce try deserializing a parquet file that is greater than 2.2 GB.

A workaround is to force the go library to deserialize the field/column as LargeBinary instead of Binary:

Error looks like:

panic: runtime error: slice bounds out of range [:-2147483014]

goroutine 95 [running]:
github.com/apache/arrow/go/v17/arrow/array.(*Binary).Value(...)
	/go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:59
github.com/apache/arrow/go/v17/arrow/array.(*Binary).ValueStr(0xc000178d20?, 0xc091402a00?)
	/go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:67 +0xfa
extractorvalidator/data.BootstrapRecordsFromParquet({0x1de1a40, 0xcc6a9775f0}, 0x0)
	/.../data/records.go:78 +0x582
main.validationWorker({0x1dccd90, 0x2c31840}, 0x0?, {0x0?}, 0xc0000315e0, 0xc000001de0, 0xc0000fe9c0)
	/.../command.go:428 +0x125
created by main.RunValidateCmd in goroutine 1
	/.../command.go:174 +0xb90

version and platform

Arrow Version: github.com/apache/arrow/go/v17 v17.0.0
Platform: Linux 20.04.1-Ubuntu  x86_64 x86_64 x86_64 GNU/Linux

Component(s)

Parquet, Other

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions