Skip to content

[C++][Parquet] Support row group filtering for nested paths #39064

@jorisvandenbossche

Description

@jorisvandenbossche

Currently the filtering of row groups based on a predicate only supports non-nested paths. When getting the statistics, this only works for a leaf node:

std::optional<compute::Expression> ColumnChunkStatisticsAsExpression(
const SchemaField& schema_field, const parquet::RowGroupMetaData& metadata) {
// For the remaining of this function, failure to extract/parse statistics
// are ignored by returning nullptr. The goal is two fold. First
// avoid an optimization which breaks the computation. Second, allow the
// following columns to maybe succeed in extracting column statistics.
// For now, only leaf (primitive) types are supported.
if (!schema_field.is_leaf()) {
return std::nullopt;
}

but we are calling this ColumnChunkStatisticsAsExpression function with the struct parent, and not with the struct field leaf. The schema_field passed to the function above is created with match[0], i.e. only the first part of the matching field path:

const SchemaField& schema_field = manifest_->schema_fields[match[0]];


To illustrate this, creating a small test file with a nested struct column and consisting of two row groups:

import pyarrow as pa
import pyarrow.parquet as pq

struct_arr = pa.StructArray.from_arrays([[1, 2, 3, 4]]*4, names=["xmin", "xmax", "ymin", "ymax"])
table = pa.table({"geom": [1, 2, 3, 4], "bbox": struct_arr})

pq.write_table(table, "test_bbox_struct.parquet", row_group_size=2)

Reading this through the Datasets API with a filter seems to filter this correctly:

import pyarrow.dataset as ds
dataset = ds.dataset("test_bbox_struct.parquet", format="parquet")

dataset.to_table(filter=ds.field("bbox", "xmax") <=2).to_pandas()
#    geom                                          bbox
# 0     1  {'xmin': 1, 'xmax': 1, 'ymin': 1, 'ymax': 1}
# 1     2  {'xmin': 2, 'xmax': 2, 'ymin': 2, 'ymax': 2}

However, that is only because we correctly filter this with a nested field ref in the second step, i.e. doing an actual filter operation after reading the data. But if we look at APIs that just does the row group filtering step, we can see this is currently not being filtered at the row group stage:

In [2]: fragment = list(dataset.get_fragments())[0]

In [3]: fragment.split_by_row_group()
Out[3]: 
[<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
 <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]

In [4]: fragment.split_by_row_group(filter=ds.field("bbox", "xmax") <=2)
Out[4]: 
[<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
 <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions