fix(table/scanner): Fix nested field scan by zeroshade · Pull Request #311 · apache/iceberg-go

zeroshade · 2025-02-20T19:30:35Z

Fixes #309

There was a combination of factors that caused the initial problem:

The arrow-go/v18/parquet/pqarrow library wasn't properly propagating PARQUET:field_id metadata for children of List or Map typed fields
We only iterated the fields and skipped list/maptypes when selecting column indexes, this caused us to miss the children. Instead we need to iterate all of the field IDs, this change updates that.
When pruning parquet fields we were not propagating the correct ColIndex for map typed columns, we want the leaves so we need the ColIndex of the children
creating the output arrays during ToRequestedSchema led to a memory leak for list/map columns that needed to be fixed.

A unit test has been added to ensure we are properly able to read the test_all_types table and get the rows without error.

table/scanner_test.go

kevinjqliu · 2025-02-25T16:16:20Z

table/arrow_utils.go

 // For dealing with nested fields (List, Struct, Map) if includeFieldIDs is true, then
 // the child fields will contain a metadata key PARQUET:field_id set to the field id.
-func TypeToArrowType(t iceberg.Type, includeFieldIDs bool) (arrow.DataType, error) {
+func TypeToArrowType(t iceberg.Type, includeFieldIDs bool, useLargeTypes bool) (arrow.DataType, error) {


nit add useLargeTypes to docstring above

Fokko · 2025-02-26T10:52:45Z

table/arrow_utils.go

 func (c convertToArrow) List(list iceberg.ListType, elemResult arrow.Field) arrow.Field {
 	elemField := c.Field(list.ElementField(), elemResult)
-	return arrow.Field{Type: arrow.LargeListOfField(elemField)}
+	if c.useLargeTypes {


I was pretty vocal on this one on the Python side. I strongly believe that we should not expose the options of large types to the user, and that's the direction that we're heading with PyIceberg. In the end, it is up to Arrow to decide if you need large types, or if small types are sufficient.

That's actually pretty difficult to do, and would make schema conversion inconsistent.

Determining whether to use the large types cannot be determined until you have the data at which point, if you're streaming the data, it's too late to switch as you can't safely change the schema during a stream. For example:

Start reading parquet file with string column, total column data in this file is only 100MB so we use a regular string (small type) and start streaming record batches

one of the last files has 3GB of raw data in the string column so we use LargeString for that column from that file

We can't cast the LargeString to String, we can't change the schema of the stream to change the column to LargeString, so now we're stuck.

The same problem can occur for List/LargeList depending on the number of total elements in a given column among the lists.

We also can't determine ahead of time by the stats in the iceberg metadata alone whether or not we should use Large types or not. The only way to know ahead of time is to read the parquet file metadata for every data file before we start producing record batches and then reconcile whether or not we should use large types before we start processing the files.

It looks like in the pyiceberg PR you linked, if i'm reading it correctly, you just automatically push everything to large types for streaming to avoid the problems I mentioned above? which kinda defeats the benefits if the goal was to avoid using LargeTypes when they aren't needed (in most cases they aren't needed).

kevinjqliu

LGTM! I think we can revisit the discussion around large types later since the config is optional

zeroshade added 2 commits February 20, 2025 14:23

fix(table/scanner): fix scanning of List/Map typed fields.

c1971bf

point at Arrow fork for now

5ad625d

zeroshade requested review from Fokko and kevinjqliu February 20, 2025 19:30

zeroshade added 4 commits February 20, 2025 14:35

proper error propagation

6f83075

update large types

99ab06f

fix pruning

120796f

point at updated main branch for arrow-go

a9bc6dd

kevinjqliu reviewed Feb 25, 2025

View reviewed changes

Fokko reviewed Feb 26, 2025

View reviewed changes

zeroshade added 5 commits February 26, 2025 11:54

Merge branch 'main' into fix-nested-field-scan

f0d787f

run go mod tidy

3c08737

use updated arrow version

d302cae

disable S3 checksum log warning

45cdb9f

re generate go.sum

a1fa78a

kevinjqliu approved these changes Mar 1, 2025

View reviewed changes

zeroshade merged commit c43f0ed into apache:main Mar 1, 2025
10 checks passed

zeroshade deleted the fix-nested-field-scan branch March 1, 2025 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(table/scanner): Fix nested field scan#311

fix(table/scanner): Fix nested field scan#311
zeroshade merged 11 commits intoapache:mainfrom
zeroshade:fix-nested-field-scan

zeroshade commented Feb 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

kevinjqliu Feb 25, 2025

Uh oh!

Fokko Feb 26, 2025

Uh oh!

zeroshade Feb 26, 2025

Uh oh!

kevinjqliu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zeroshade commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kevinjqliu Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

zeroshade Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zeroshade commented Feb 20, 2025 •

edited

Loading