Fix for selecting NaN values from Parquet files by Mytherin · Pull Request #16962 · duckdb/duckdb

Mytherin · 2025-04-02T20:28:12Z

The statistics in Parquet files are not sufficient to know whether or not NaN values are present in a file. There is work on exposing this in apache/parquet-format#196 (but given that that is two years old - who knows if/when that will get merged and when writers will start supporting it).

For now we need to assume NaN is present in all floating point columns. This PR makes the following changes to facilitate this:

Stats emitted by floating point columns are now [min, NaN] instead of [min, max]. Because NaN in DuckDB is larger than all other values - we must set it at the max value.
For pruning, we do not use this bound - but instead search two bounds: [min, max] and [NaN, NaN]. As a result, some scan pruning can still happen as long as the filter does not have an unbounded upper bound (i.e. we cannot prune anymore for x > 5, but we can prune for x = 5 or for x > 5 and x < 10.

…tats

Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)

This is a follow-up to #16962, which partly reverts the newly introduced nan-handling behavior by putting it behind a new boolean named parameter `can_have_nan` in the `parquet_scan` and `COPY ... FROM (FORMAT PARQUET)` functions. Until the parquet spec decides on how to deal with NaN values from the perspective of floating-point filter pruning, this is probably the better option to avoid performance regression in the common case, closing: #17855

Mytherin added 3 commits April 2, 2025 21:42

Clean up string filter check

361cbc2

For pruning - search through the Parquet file using min/max and nan s…

2cf0401

…tats

Greater than also includes nan

1dec857

This was referenced Apr 2, 2025

Can't filter for NaN in polars replacement scan #16942

Closed

Invalid NaN statistics when reading from Parquet #7803

Closed

Avoid accessing stats for generated columns

37e6455

Mytherin marked this pull request as draft April 2, 2025 20:45

Mytherin marked this pull request as ready for review April 2, 2025 20:45

Fix test

c4dd4a3

duckdb-draftbot marked this pull request as draft April 3, 2025 08:06

Mytherin marked this pull request as ready for review April 3, 2025 08:07

Mytherin merged commit b310af2 into duckdb:main Apr 3, 2025
50 of 51 checks passed

lwwmanning mentioned this pull request Apr 24, 2025

NaN cannot be a min/max value of a primitive array vortex-data/vortex#3104

Merged

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@b310af2

f1a9f10

Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@b310af2

bf3124c

Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025

vendor: Update vendored sources to duckdb/duckdb@b310af2

ce11948

Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025

vendor: Update vendored sources to duckdb/duckdb@b310af2

d69876a

Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025

vendor: Update vendored sources to duckdb/duckdb@b310af2

f307a36

Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)

Maxxen mentioned this pull request Jun 11, 2025

Add option to control parquet NaN pruning #17883

Merged

Mytherin deleted the parquetnan branch June 12, 2025 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for selecting NaN values from Parquet files#16962

Fix for selecting NaN values from Parquet files#16962
Mytherin merged 5 commits intoduckdb:mainfrom
Mytherin:parquetnan

Mytherin commented Apr 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mytherin commented Apr 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant