Skip to content

[Go] [arrow/ipc/writer] voffsets are not updated when Binary-like arrays get truncated #41993

@notfilippo

Description

@notfilippo

Describe the bug, including details regarding any error messages, version, and platform.

I've stumbled upon this bug while playing around with the results returned from datafusion. It seems that in certain scenarios multiple arrays from different record batches, representing the same column, might share their value buffer. The problem arises when trying to write those records using an ipc.Writer: If one of the columns which has a single shared buffer happens to be of a binary type the written value is invalid.

This happens because each array's value buffer gets truncated to only the part referred to by its offsets but the offsets are never updated, so now they potentially point to memory outside of the truncated values. When trying to read the binary-like array via the ipc.Reader an error gets returned: string offsets out of bounds of data buffer.

How to reproduce:

func main() {
	var buf bytes.Buffer
	buf.WriteString("apple")
	buf.WriteString("pear")
	buf.WriteString("banana")
	values := buf.Bytes()

	offsets := []int32{5, 9, 15} // <-- only "pear" and "banana"
	voffsets := arrow.Int32Traits.CastToBytes(offsets)

	validity := []byte{0}
	bitutil.SetBit(validity, 0)
	bitutil.SetBit(validity, 1)

	data := array.NewData(
		arrow.BinaryTypes.String,
		2, // <-- only "pear" and "banana"
		[]*memory.Buffer{
			memory.NewBufferBytes(validity),
			memory.NewBufferBytes(voffsets),
			memory.NewBufferBytes(values),
		},
		nil,
		0,
		0,
	)

	str := array.NewStringData(data)
	fmt.Println(str) // outputs: ["pear" "banana"]

	schema := arrow.NewSchema([]arrow.Field{
		{
			Name:     "string",
			Type:     arrow.BinaryTypes.String,
			Nullable: true,
		},
	}, nil)
	record := array.NewRecord(schema, []arrow.Array{str}, 2)

	var output bytes.Buffer
	writer := ipc.NewWriter(&output, ipc.WithSchema(schema))

	err := writer.Write(record)
	if err != nil {
		log.Fatal(err)
	}

	err = writer.Close()
	if err != nil {
		log.Fatal(err) 
	}

	reader, err := ipc.NewReader(bytes.NewReader(output.Bytes()), ipc.WithSchema(schema))
	if err != nil {
		log.Fatal(err)
	}

	reader.Next()
	if reader.Err() != nil {
		log.Fatal(reader.Err()) // string offsets out of bounds of data buffer
	}
}

Component(s)

Go

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions