-
Notifications
You must be signed in to change notification settings - Fork 105
array.Binary and array.String should use int64 offsets #195
Copy link
Copy link
Closed
Labels
Type: bugSomething isn't workingSomething isn't working
Description
Describe the bug, including details regarding any error messages, version, and platform.
LargeBinary and LargeString use int64 offsets, however Binary and String types use int32 offsets, this makes them susceptible to slice index out of bounds errors when the column/array is larger than ~2GB ~= 2^31 bytes.
To reproduce try deserializing a parquet file that is greater than 2.2 GB.
A workaround is to force the go library to deserialize the field/column as LargeBinary instead of Binary:
- explicitly store the arrow schema during write. see
store_schemahttps://arrow.apache.org/docs/cpp/parquet.html#roundtripping-arrow-types-and-schema - and schema explicitly uses the large_binary or large_string type when defining the schema that is used to write the parquet files.
Error looks like:
panic: runtime error: slice bounds out of range [:-2147483014]
goroutine 95 [running]:
github.com/apache/arrow/go/v17/arrow/array.(*Binary).Value(...)
/go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:59
github.com/apache/arrow/go/v17/arrow/array.(*Binary).ValueStr(0xc000178d20?, 0xc091402a00?)
/go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:67 +0xfa
extractorvalidator/data.BootstrapRecordsFromParquet({0x1de1a40, 0xcc6a9775f0}, 0x0)
/.../data/records.go:78 +0x582
main.validationWorker({0x1dccd90, 0x2c31840}, 0x0?, {0x0?}, 0xc0000315e0, 0xc000001de0, 0xc0000fe9c0)
/.../command.go:428 +0x125
created by main.RunValidateCmd in goroutine 1
/.../command.go:174 +0xb90version and platform
Arrow Version: github.com/apache/arrow/go/v17 v17.0.0
Platform: Linux 20.04.1-Ubuntu x86_64 x86_64 x86_64 GNU/Linux
Component(s)
Parquet, Other
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type: bugSomething isn't workingSomething isn't working