Skip to content

Parquet statistics missing when reading Utf8 as Utf8View #12123

@alamb

Description

@alamb

Part of #11752

Describe the bug

One of the last remaining issues causing test failures when we enable reading StringView by default in #12092 is as follows:

failures:
    datasource::file_format::parquet::tests::fetch_metadata_with_size_hint
    datasource::file_format::parquet::tests::read_alltypes_plain_parquet
    datasource::file_format::parquet::tests::read_binary_alltypes_plain_parquet
    datasource::file_format::parquet::tests::read_merged_batches
    datasource::file_format::parquet::tests::test_statistics_from_parquet_metadata

To Reproduce

#12092

And then run:

cargo test -p datafusion --lib -- file_format::parquet

Expected behavior

The tests should pass

Additional context

The problem is that table schema is configured to be UTF8View but the file schema is using Utf8 (so the stats are returned as Utf8) and the accumulators can't deal updating a Utf8View from Utf8.

@XiangpengHao solved this issue in #11862 (comment) to thread the parameter and then and cast the file schema appropriately.

The code isn't great to start with and adding a new parameter makes it worse.

I also think there are some bugs lurking there that maybe we could improve if the code was more testable

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions