Fix for selecting NaN values from Parquet files#16962
Merged
Mytherin merged 5 commits intoduckdb:mainfrom Apr 3, 2025
Merged
Conversation
This was referenced Apr 2, 2025
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 15, 2025
Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 15, 2025
Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 16, 2025
Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 16, 2025
Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
May 17, 2025
Fix for selecting NaN values from Parquet files (duckdb/duckdb#16962)
Mytherin
added a commit
that referenced
this pull request
Jun 12, 2025
This is a follow-up to #16962, which partly reverts the newly introduced nan-handling behavior by putting it behind a new boolean named parameter `can_have_nan` in the `parquet_scan` and `COPY ... FROM (FORMAT PARQUET)` functions. Until the parquet spec decides on how to deal with NaN values from the perspective of floating-point filter pruning, this is probably the better option to avoid performance regression in the common case, closing: #17855
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The statistics in Parquet files are not sufficient to know whether or not
NaNvalues are present in a file. There is work on exposing this in apache/parquet-format#196 (but given that that is two years old - who knows if/when that will get merged and when writers will start supporting it).For now we need to assume
NaNis present in all floating point columns. This PR makes the following changes to facilitate this:[min, NaN]instead of[min, max]. BecauseNaNin DuckDB is larger than all other values - we must set it at the max value.[min, max]and[NaN, NaN]. As a result, some scan pruning can still happen as long as the filter does not have an unbounded upper bound (i.e. we cannot prune anymore forx > 5, but we can prune forx = 5or forx > 5 and x < 10.