Skip to content

[Go] Partial struct column reads panic due to mismatched childFields/childReaders slices #628

@seongkim0228

Description

@seongkim0228

Describe the bug, including details regarding any error messages, version, and platform.

When reading a subset of columns from a struct field (partial column projection), the getReader function in pqarrow/file_reader.go fails to filter childFields in sync with childReaders, causing a length mismatch and potential panic.

In file_reader.go lines 594-604, only childReaders is pruned to remove nil entries, but childFields retains zero-valued arrow.Field{} entries for skipped children:

// because we performed getReader concurrently, we need to prune out any empty readers
childReaders = slices.DeleteFunc(childReaders,
func(r *ColumnReader) bool { return r == nil })
if len(childFields) == 0 {
return nil, nil
}
filtered := arrow.Field{
Name: arrowField.Name, Nullable: arrowField.Nullable,
Metadata: arrowField.Metadata, Type: arrow.StructOf(childFields...),
}
out = newStructReader(&rctx, &filtered, field.LevelInfo, childReaders, fr.Props)

This causes panic on arrow.StructOf(childFields...)
func StructOf(fs ...Field) *StructType {
n := len(fs)
if n == 0 {
return &StructType{}
}
t := &StructType{
fields: make([]Field, n),
index: make(map[string][]int, n),
}
for i, f := range fs {
if f.Type == nil {
panic("arrow: field with nil DataType")
}

Reproduction

package main

import (
	"bytes"
	"context"

	"github.com/apache/arrow-go/v18/arrow"
	"github.com/apache/arrow-go/v18/arrow/array"
	"github.com/apache/arrow-go/v18/arrow/memory"
	"github.com/apache/arrow-go/v18/parquet/file"
	"github.com/apache/arrow-go/v18/parquet/pqarrow"
)

func main() {
	schema := arrow.NewSchema([]arrow.Field{
		{Name: "nested", Type: arrow.StructOf(
			arrow.Field{Name: "a", Type: arrow.PrimitiveTypes.Float64},
			arrow.Field{Name: "b", Type: arrow.PrimitiveTypes.Float64},
		)},
	}, nil)

	buf := new(bytes.Buffer)
	writer, _ := pqarrow.NewFileWriter(schema, buf, nil, pqarrow.DefaultWriterProps())
	b := array.NewRecordBuilder(memory.DefaultAllocator, schema)
	sb := b.Field(0).(*array.StructBuilder)
	sb.Append(true)
	sb.FieldBuilder(0).(*array.Float64Builder).Append(1.0)
	sb.FieldBuilder(1).(*array.Float64Builder).Append(2.0)
	writer.Write(b.NewRecord())
	writer.Close()

	pf, _ := file.NewParquetReader(bytes.NewReader(buf.Bytes()))
	fr, _ := pqarrow.NewFileReader(pf, pqarrow.ArrowReadProperties{}, memory.DefaultAllocator)

	// Only read nested.a (leaf index 0), not nested.b (leaf index 1)
	partialLeaves := map[int]bool{0: true}
	fieldIdx, _ := fr.Manifest.GetFieldIndices([]int{0})

	// This panics due to childFields/childReaders length mismatch
	fr.GetFieldReader(context.Background(), fieldIdx[0], partialLeaves, []int{0})
}

It is expected that partial struct column reads should work correctly, returning a reader for only the selected fields. but It actually panic due to mismatched slice lengths.

Suggested Fix

Filter childFields alongside childReaders:

childReaders = slices.DeleteFunc(childReaders,
    func(r *ColumnReader) bool { return r == nil })
childFields = slices.DeleteFunc(childFields,
    func(f arrow.Field) bool { return f.Type == nil })

if len(childFields) == 0 {
    return nil, nil
}

Environment
arrow-go version: v18.5.0
Go version: 1.25.5

Component(s)

Parquet

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions