[C++][Parquet] Support row group filtering for nested paths

Currently the filtering of row groups based on a predicate only supports non-nested paths. When getting the statistics, this only works for a leaf node:

https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/arrow/dataset/file_parquet.cc#L160-L170

but we are calling this ColumnChunkStatisticsAsExpression function with the struct parent, and not with the struct field leaf. The `schema_field` passed to the function above is created with `match[0]`, i.e. only the first part of the matching field path:

https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/arrow/dataset/file_parquet.cc#L903

---

To illustrate this, creating a small test file with a nested struct column and consisting of two row groups:

```python
import pyarrow as pa
import pyarrow.parquet as pq

struct_arr = pa.StructArray.from_arrays([[1, 2, 3, 4]]*4, names=["xmin", "xmax", "ymin", "ymax"])
table = pa.table({"geom": [1, 2, 3, 4], "bbox": struct_arr})

pq.write_table(table, "test_bbox_struct.parquet", row_group_size=2)
```

Reading this through the Datasets API with a filter _seems_ to filter this correctly:

```python
import pyarrow.dataset as ds
dataset = ds.dataset("test_bbox_struct.parquet", format="parquet")

dataset.to_table(filter=ds.field("bbox", "xmax") <=2).to_pandas()
#    geom                                          bbox
# 0     1  {'xmin': 1, 'xmax': 1, 'ymin': 1, 'ymax': 1}
# 1     2  {'xmin': 2, 'xmax': 2, 'ymin': 2, 'ymax': 2}
```

However, that is only because we correctly filter this with a nested field ref in the second step, i.e. doing an actual filter operation after reading the data.  But if we look at APIs that just does the row group filtering step, we can see this is currently not being filtered at the row group stage:

```python
In [2]: fragment = list(dataset.get_fragments())[0]

In [3]: fragment.split_by_row_group()
Out[3]: 
[<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
 <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]

In [4]: fragment.split_by_row_group(filter=ds.field("bbox", "xmax") <=2)
Out[4]: 
[<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
 <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]
```

	std::optional<compute::Expression> ColumnChunkStatisticsAsExpression(
	const SchemaField& schema_field, const parquet::RowGroupMetaData& metadata) {
	// For the remaining of this function, failure to extract/parse statistics
	// are ignored by returning nullptr. The goal is two fold. First
	// avoid an optimization which breaks the computation. Second, allow the
	// following columns to maybe succeed in extracting column statistics.

	// For now, only leaf (primitive) types are supported.
	if (!schema_field.is_leaf()) {
	return std::nullopt;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++][Parquet] Support row group filtering for nested paths #39064

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Parquet] Support row group filtering for nested paths #39064

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions