Skip to content

Fix for selecting NaN values from Parquet files#16962

Merged
Mytherin merged 5 commits intoduckdb:mainfrom
Mytherin:parquetnan
Apr 3, 2025
Merged

Fix for selecting NaN values from Parquet files#16962
Mytherin merged 5 commits intoduckdb:mainfrom
Mytherin:parquetnan

Conversation

@Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Apr 2, 2025

The statistics in Parquet files are not sufficient to know whether or not NaN values are present in a file. There is work on exposing this in apache/parquet-format#196 (but given that that is two years old - who knows if/when that will get merged and when writers will start supporting it).

For now we need to assume NaN is present in all floating point columns. This PR makes the following changes to facilitate this:

  • Stats emitted by floating point columns are now [min, NaN] instead of [min, max]. Because NaN in DuckDB is larger than all other values - we must set it at the max value.
  • For pruning, we do not use this bound - but instead search two bounds: [min, max] and [NaN, NaN]. As a result, some scan pruning can still happen as long as the filter does not have an unbounded upper bound (i.e. we cannot prune anymore for x > 5, but we can prune for x = 5 or for x > 5 and x < 10.

@Mytherin Mytherin marked this pull request as draft April 2, 2025 20:45
@Mytherin Mytherin marked this pull request as ready for review April 2, 2025 20:45
@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 3, 2025 08:06
@Mytherin Mytherin marked this pull request as ready for review April 3, 2025 08:07
@Mytherin Mytherin merged commit b310af2 into duckdb:main Apr 3, 2025
50 of 51 checks passed
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)
Mytherin added a commit that referenced this pull request Jun 12, 2025
This is a follow-up to #16962,
which partly reverts the newly introduced nan-handling behavior by
putting it behind a new boolean named parameter `can_have_nan` in the
`parquet_scan` and `COPY ... FROM (FORMAT PARQUET)` functions.

Until the parquet spec decides on how to deal with NaN values from the
perspective of floating-point filter pruning, this is probably the
better option to avoid performance regression in the common case,
closing: #17855
@Mytherin Mytherin deleted the parquetnan branch June 12, 2025 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant