-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Currently the filtering of row groups based on a predicate only supports non-nested paths. When getting the statistics, this only works for a leaf node:
arrow/cpp/src/arrow/dataset/file_parquet.cc
Lines 160 to 170 in f7947cc
| std::optional<compute::Expression> ColumnChunkStatisticsAsExpression( | |
| const SchemaField& schema_field, const parquet::RowGroupMetaData& metadata) { | |
| // For the remaining of this function, failure to extract/parse statistics | |
| // are ignored by returning nullptr. The goal is two fold. First | |
| // avoid an optimization which breaks the computation. Second, allow the | |
| // following columns to maybe succeed in extracting column statistics. | |
| // For now, only leaf (primitive) types are supported. | |
| if (!schema_field.is_leaf()) { | |
| return std::nullopt; | |
| } |
but we are calling this ColumnChunkStatisticsAsExpression function with the struct parent, and not with the struct field leaf. The schema_field passed to the function above is created with match[0], i.e. only the first part of the matching field path:
arrow/cpp/src/arrow/dataset/file_parquet.cc
Line 903 in f7947cc
| const SchemaField& schema_field = manifest_->schema_fields[match[0]]; |
To illustrate this, creating a small test file with a nested struct column and consisting of two row groups:
import pyarrow as pa
import pyarrow.parquet as pq
struct_arr = pa.StructArray.from_arrays([[1, 2, 3, 4]]*4, names=["xmin", "xmax", "ymin", "ymax"])
table = pa.table({"geom": [1, 2, 3, 4], "bbox": struct_arr})
pq.write_table(table, "test_bbox_struct.parquet", row_group_size=2)Reading this through the Datasets API with a filter seems to filter this correctly:
import pyarrow.dataset as ds
dataset = ds.dataset("test_bbox_struct.parquet", format="parquet")
dataset.to_table(filter=ds.field("bbox", "xmax") <=2).to_pandas()
# geom bbox
# 0 1 {'xmin': 1, 'xmax': 1, 'ymin': 1, 'ymax': 1}
# 1 2 {'xmin': 2, 'xmax': 2, 'ymin': 2, 'ymax': 2}However, that is only because we correctly filter this with a nested field ref in the second step, i.e. doing an actual filter operation after reading the data. But if we look at APIs that just does the row group filtering step, we can see this is currently not being filtered at the row group stage:
In [2]: fragment = list(dataset.get_fragments())[0]
In [3]: fragment.split_by_row_group()
Out[3]:
[<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]
In [4]: fragment.split_by_row_group(filter=ds.field("bbox", "xmax") <=2)
Out[4]:
[<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]