-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Substrait: Round-trip generate_series table function scans via ReadRel advanced_extension #19407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kosiew
wants to merge
24
commits into
apache:main
Choose a base branch
from
kosiew:generate_series_error-16279
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+826
−122
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Update GenerateSeries table functions to log evaluated argument lists, including defaults and nulls. Expose them through table_function_details for accurate scans. Enhance Substrait ReadRel serialization to support generate_series scans by emitting an advanced extension with function name and arguments, documenting the payload format. Revise physical plan deserialization to initialize generate_series tables with evaluated arguments and null metadata, maintaining context of the table function.
Parse advanced extension metadata to detect table function invocations, resolve arguments, and build scans with the resulting providers. Introduce a lookup hook in SubstraitConsumer for table functions with a default implementation linked to session state registrations.
Implement tests for generate_series and range, ensuring the correctness of serialized plan metadata and query results. Register the new suite in the Substrait test module manifest.
Use the public byte slice accessor to decode table function metadata, eliminating the need for private APIs. Directly return planning errors for missing table functions to avoid nested Result types and potential compilation failures.
- Change read location from NamedTable.advanced_extension to ReadRel.advanced_extension. - Rename TableFunctionMetadata to TableFunctionReadRelExtension for consistency. - Update argument type from Vec<Expression> to Vec<Literal>. - Implement type URL validation to check type_url == TABLE_FUNCTION_TYPE_URL before decoding. - Properly convert literals using from_substrait_literal() and wrap in Expr::Literal(value, None).
… Substrait producer
…ils; add docs and tests
…stants and update tests
Eliminate the arg_count field from GenSeriesArgs::ContainsNull and streamline the code. Update evaluated_args() to use a fixed count of 3 NULLs, in line with the protobuf schema. Also unify error handling by replacing plan_datafusion_err! with plan_err! in read_rel.rs. Fix README formatting by removing empty backticks for improved readability.
…ate Substrait conversion utilities
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Substrait round-trip tests and ~35 sqllogictest cases were failing with errors like "No table named 'generate_series()'" when deserializing a Substrait plan back into a DataFusion logical plan.
The root cause is that table function invocations (e.g.
generate_series,range) were being serialized as ordinaryReadRel::NamedTablescans, so the consumer attempted to resolve a table named like the function call rather than reconstructing a table function scan.This change adds a stable mechanism to preserve table-function intent and evaluated arguments in the Substrait plan, enabling correct reconstruction during consumption and eliminating the "No table named ..." failures.
What changes are included in this PR?
Catalog API extension:
Added
TableProvider::table_function_details()(defaultNone) and aTableFunctionDetailsstruct carrying:&'static str)&[ScalarValue])Kept a convenience
table_function_name()helper that delegates totable_function_details().Expose shared scalar constant:
datafusion_common::scalar::NANOS_PER_DAYpublic and reused it in date handling to avoid local redefinition.Built-in table function provider updates:
GenerateSeriesTableto cache evaluated arguments and to implementtable_function_details()so the producer can serialize table function calls without downcasting.NULL/ missing inputs and to always represent table function calls with a complete 3-argument list (start, end, step) where applicable.Substrait producer support (logical plan):
Added a
ReadRel.advanced_extensionpayload when aTableScanoriginates from a table function provider.Introduced a dedicated protobuf message
TableFunctionReadRelExtensionencoded asgoogle.protobuf.Anyusing a sharedtype_urlconstant:type.googleapis.com/datafusion.substrait.TableFunctionReadRelThis payload records the table function name and evaluated argument literals (after defaults/coercions).
Substrait consumer support (logical plan):
Added parsing of the
ReadRel.advanced_extensionpayload and reconstruction of the table function scan by:SubstraitConsumer::get_table_function)create_table_providerwith literalExprargumentsAdded a
provider_overridepath so a reconstructed provider can be used even when theNamedTablereference would not resolve.Code organization:
logical_plan::utils(producer + consumer), reducing duplication and improving maintainability.Dependencies:
datafusion-functions-tabledependency todatafusion-substraitso built-in table function behavior (and tests) are available in round-trip contexts.Tests:
Added a producer unit test ensuring
generate_seriestable scans emit the expected advanced_extension payload and argument normalization.Added an integration test
table_function_roundtrip.rsvalidating round-trip correctness for:generate_series(1, 4)range(2, 8, 3)generate_series(NULL, 5)Are these changes tested?
Before
After
Yes.
ReadRel.advanced_extensionis present, uses the expected type URL, and encodes normalized arguments forgenerate_series(1, 3).generate_seriesandrange, including a NULL-argument case.These tests exercise both the new producer metadata emission path and the new consumer reconstruction path, covering the scenario that caused the sqllogictest failures.
Are there any user-facing changes?
Improved Substrait round-trip behavior for queries using built-in table functions like
generate_seriesandrange. Previously, such plans could fail to round-trip with "No table named ..." errors; with this change they can be reconstructed and executed correctly.New (optional) API hook for custom table providers:
TableProvider::table_function_details()allows custom table-function providers to participate in Substrait serialization without needing producer-side downcasting.If this new method is considered a public API surface change, the PR should carry the
api changelabel.LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.