Skip to content

Incorrect results with parquet filtering pushdown enabled #4005

@alamb

Description

@alamb

Describe the bug
DataFusion gets different answers when parquet pushdown is enabled

NOTE that pushdown filtering is not enabled by default (as we are still working on it) so this issue will not likely affect users:

To Reproduce

  1. Download data from
    repro.zip
  2. Run datafusion CLI

The query run is

select count(*) from foo where container = 'backend_container_0' OR pod = 'aqcathnxqsphdhgjtgvxsfyiwbmhlmg';

Expected behavior
Same answer should be produced with and without page index filtering enabled. However, the answers are different

Without filter pushdown 39982 rows are produced

$ DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=false datafusion-cli -f script.sql
...
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 39982           |
+-----------------+

With it enabled:

DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true datafusion-cli -f script.sql
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 0               |
+-----------------+
1 row in set. Query took 0.004 seconds.

Additional context
Found by the test here #3976

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions