forked from apache/datafusion
-
Notifications
You must be signed in to change notification settings - Fork 0
Support for metadata columns (location, size, last_modified) in ListingTableProvider
#74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
phillipleblanc
merged 8 commits into
spiceai-43
from
phillip/250225-metadata-cols-listing
Mar 10, 2025
Merged
Support for metadata columns (location, size, last_modified) in ListingTableProvider
#74
phillipleblanc
merged 8 commits into
spiceai-43
from
phillip/250225-metadata-cols-listing
Mar 10, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sgrebnov
approved these changes
Mar 10, 2025
phillipleblanc
added a commit
that referenced
this pull request
Apr 8, 2025
… ListingTableProvider (#74) * Initial work on metadata columns * Metadata filtering working * Working on plumbing to file scan config * wip * All wired up * Working! * Use MetadataColumn enum * Add integration tests for metadata selection + pushdown filtering UPSTREAM NOTE: This PR was submitted upstream: apache#15181
phillipleblanc
added a commit
that referenced
this pull request
Apr 8, 2025
… ListingTableProvider (#74) * Initial work on metadata columns * Metadata filtering working * Working on plumbing to file scan config * wip * All wired up * Working! * Use MetadataColumn enum * Add integration tests for metadata selection + pushdown filtering UPSTREAM NOTE: This PR was submitted upstream: apache#15181
phillipleblanc
added a commit
that referenced
this pull request
Apr 8, 2025
… ListingTableProvider (#74) * Initial work on metadata columns * Metadata filtering working * Working on plumbing to file scan config * wip * All wired up * Working! * Use MetadataColumn enum * Add integration tests for metadata selection + pushdown filtering UPSTREAM NOTE: This PR was submitted upstream: apache#15181
phillipleblanc
added a commit
that referenced
this pull request
Apr 17, 2025
… ListingTableProvider (#74) * Initial work on metadata columns * Metadata filtering working * Working on plumbing to file scan config * wip * All wired up * Working! * Use MetadataColumn enum * Add integration tests for metadata selection + pushdown filtering UPSTREAM NOTE: This PR was submitted upstream: apache#15181
phillipleblanc
added a commit
that referenced
this pull request
Apr 25, 2025
… ListingTableProvider (#74) * Initial work on metadata columns * Metadata filtering working * Working on plumbing to file scan config * wip * All wired up * Working! * Use MetadataColumn enum * Add integration tests for metadata selection + pushdown filtering UPSTREAM NOTE: This PR was submitted upstream: apache#15181
phillipleblanc
added a commit
that referenced
this pull request
May 7, 2025
… ListingTableProvider (#74) * Initial work on metadata columns * Metadata filtering working * Working on plumbing to file scan config * wip * All wired up * Working! * Use MetadataColumn enum * Add integration tests for metadata selection + pushdown filtering UPSTREAM NOTE: This PR was submitted upstream: apache#15181
sgrebnov
pushed a commit
that referenced
this pull request
May 22, 2025
… ListingTableProvider (#74) * Initial work on metadata columns * Metadata filtering working * Working on plumbing to file scan config * wip * All wired up * Working! * Use MetadataColumn enum * Add integration tests for metadata selection + pushdown filtering UPSTREAM NOTE: This PR was submitted upstream: apache#15181 # Conflicts: # datafusion/core/src/datasource/listing/table.rs # datafusion/core/tests/sql/path_partition.rs # datafusion/datasource/src/file_scan_config.rs # datafusion/datasource/src/mod.rs
sgrebnov
pushed a commit
that referenced
this pull request
May 26, 2025
… ListingTableProvider (#74) UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually apache#15181
kczimm
pushed a commit
that referenced
this pull request
Aug 19, 2025
… ListingTableProvider (#74) UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually apache#15181
kczimm
pushed a commit
that referenced
this pull request
Aug 21, 2025
… ListingTableProvider (#74) UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually apache#15181
kczimm
pushed a commit
that referenced
this pull request
Aug 21, 2025
… ListingTableProvider (#74) UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually apache#15181
Jeadie
pushed a commit
that referenced
this pull request
Sep 9, 2025
… ListingTableProvider (#74) UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually apache#15181
Jeadie
pushed a commit
that referenced
this pull request
Sep 12, 2025
… ListingTableProvider (#74) UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually apache#15181
28 tasks
peasee
pushed a commit
that referenced
this pull request
Oct 27, 2025
… ListingTableProvider (#74) UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually apache#15181
peasee
added a commit
that referenced
this pull request
Oct 27, 2025
* fix: Ensure only tables or aliases that exist are projected (#52) fix: More dangling references (#54) UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is. * Support for metadata columns (`location`, `size`, `last_modified`) in ListingTableProvider (#74) UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually apache#15181 * Infer placeholder datatype for `Expr::InSubquery` (#80) UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980 * Infer placeholder datatype after `LIMIT` clause as `DataType::Int64` (#81) UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980 * Do not double alias Exprs UPSTREAM NOTE: This was attempted to be fixed with apache#15008 but was closed This is the tracking issue on DataFusion: apache#14895 Do not double alias Exprs * Add prefix to location metadata column (#82) UPSTREAM NOTE: This will not be upstreamed as is. * Infer placeholder types for CASE expressions (#87) UPSTREAM NOTE: This has not been submitted upstream yet. * Expand `infer_placeholder_types` to infer all possible placeholder types based on their expression (#88) UPSTREAM NOTE: This has not been submitted upstream yet. * Fix `Expr::infer_placeholder_types` inference to not fail (#89) UPSTREAM NOTE: This has not been submitted upstream yet. * cherry-pick parquet patch (#94) * Fix array types coercion: preserve child element nullability for list types (#96) UPSTREAM NOTE: This was submitted upstream and should be available in DF50 apache#17306 * Expand `infer_placeholder_types` to infer all possible placeholder types based on their expression (#88) UPSTREAM NOTE: This has not been submitted upstream yet. * do not enforce type guarantees on all Expr traversed in infer_placeholder_types (#97) * Use UDTF function args in `LogicalPlan::TableScan` name (#98) * use UDTF function args in LogicalPlan::TableScan name * update test snapshots * Implement timestamp_cast_dtype for SqliteDialect (#99) * Use text for sqlite timestamp * Add test * Custom timestamp format for DuckDB (#102) * Revert "cherry-pick parquet patch (#94)" This reverts commit d780cc2. * Support ExprNamed arguments to Scalar UDFs (#104) * support ExprNamed until 17379 ships * add same exprnamed lifting to udtf * resolve projection against `ListingTable` table_schema incl. partition columns (#106) * fix: Ensure ListingTable partitions are pruned when filters are not used (#108) * fix: Prune partitions when no filters are defined * fix: Backport for DF49: * review: Address comments * FileScanConfig: Preserve schema metadata across serde boundary (#107) * FileScanConfig: preserve schema metadata across serde boundary * add test * Merge conflict fixes UPSTREAM NOTE: this should not be upstreamed. This contains conflict fixes from various cherry-picks and differences in v50. * update arrow-rs fork UPSTREAM NOTE: this should not be upstreamed --------- Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> Co-authored-by: Kevin Zimmerman <4733573+kczimm@users.noreply.github.com> Co-authored-by: sgrebnov <sergei.grebnov@gmail.com> Co-authored-by: jeadie <jack@spice.ai> Co-authored-by: Jack Eadie <jack.eadie0@gmail.com> Co-authored-by: Viktor Yershov <krinart@gmail.com> Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: David Stancu <david@spice.ai>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
TBD
Rationale for this change
This enables another way to prune files that don't need to be read, similar to partitioning, but based on the metadata of the file itself. This can be used to efficiently find all data that are in files that have changed since I last did a query, for example:
SELECT * FROM test WHERE last_modified > {last_check_time}.What changes are included in this PR?
Adds a new option to the
ListingOptionsfor specifying certain metadata properties. The metadata properties that are supported arelocation,sizeandlast_modified. When those properties are included, then they are added to the table schema (similar to partition columns) and the value is filled in by looking at theObjectMetafor the file.Are these changes tested?
Yes, added tests to
path_partition.rs.Are there any user-facing changes?