[SPARK-46380][SQL]Replace current time/date prior to evaluating inline table expressions.#44316
[SPARK-46380][SQL]Replace current time/date prior to evaluating inline table expressions.#44316dbatomic wants to merge 11 commits intoapache:masterfrom
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
Outdated
Show resolved
Hide resolved
beliefer
left a comment
There was a problem hiding this comment.
I think we need a discussion.
SELECT COUNT(DISTINCT ct) FROM VALUES
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()) as data(ct)
The three call of CURRENT_TIMESTAMP() should have the same value?
That's right. All the invocations of CURRENT_TIMESTAMP/CURRENT_DATE/NOW() should be replaced with single value which represents time of query arrival. We already do this for majority of scenarios (e.g. if time function is pretty much anywhere else in the query). The bug this PR tries to solve is that we don't do this for inline tables. |
|
I checked in Postgres. |
Yep, just to illustrate that replacement happens for inline tables: postgres=# SELECT * FROM (VALUES(EXTRACT(epoch FROM current_timestamp),EXTRACT(epoch FROM current_timestamp),EXTRACT(epoch FROM current_timestamp)));-------------------+-------------------+------------------- |
1) ResolveInlineTables that will check the shape and add all the needed casts. 2) EvalInlineTables that will call the actual evaluation of the rows into LocalRelation at the end of finish analysis phase.
1) ResolveInlineTables that will check the shape and add all the needed casts. 2) EvalInlineTables that will call the actual evaluation of the rows into LocalRelation at the end of finish analysis phase.
|
I agree with the semantic changes. General comment, though: or: |
Yeah, I maybe can use this PR to deal with rand() problem as well (i.e. allow non-deterministic expressions in VALUES), because it is rather similar to CURRENT_* issue. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
Outdated
Show resolved
Hide resolved
| """SELECT COUNT(DISTINCT ct) FROM VALUES | ||
| | CURRENT_TIMESTAMP(), | ||
| | CURRENT_TIMESTAMP(), | ||
| | CURRENT_TIMESTAMP() as data(ct)""".stripMargin), Row(1)) |
There was a problem hiding this comment.
This seems not needed as we have the same test in the golden file
|
|
||
|
|
||
| -- !query | ||
| select count(distinct ct) from values now(), now(), now() as data(ct) |
There was a problem hiding this comment.
We can add a test for CURRENT_TIMESTAMP and remove https://github.com/apache/spark/pull/44316/files#r1430980065
| def earlyEvalPossible = | ||
| table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE)) | ||
| if (earlyEvalPossible) EvalInlineTables(table) else table |
There was a problem hiding this comment.
| def earlyEvalPossible = | |
| table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE)) | |
| if (earlyEvalPossible) EvalInlineTables(table) else table | |
| val earlyEvalPossible = table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE)) | |
| if (earlyEvalPossible) EvalInlineTables(table) else table |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
Outdated
Show resolved
Hide resolved
| */ | ||
| object EvalInlineTables extends Rule[LogicalPlan] with CastSupport { | ||
| override def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithSubqueriesAndPruning( | ||
| AlwaysProcess.fn, ruleId) { |
There was a problem hiding this comment.
let's add pruning for ResolvedInlineTable
| select count(distinct ct) from values now(), now(), now() as data(ct); | ||
|
|
||
| -- current_timestamp() should be kept as tempResolved inline expression. | ||
| select count(distinct ct) from values current_timestamp(), current_timestamp() as data(ct); |
There was a problem hiding this comment.
Shall we add tests mixed current_timestamp and other deterministic function?
There was a problem hiding this comment.
it's testing the correct value using count distinct.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
Outdated
Show resolved
Hide resolved
…mizer/finishAnalysis.scala Co-authored-by: Jiaan Geng <beliefer@163.com>
| InternalRow.fromSeq(row.zipWithIndex.map { case (e, ci) => | ||
| val targetType = fields(ci).dataType | ||
| try { | ||
| val castedRows: Seq[Seq[Expression]] = table.rows.map { row => |
There was a problem hiding this comment.
It seems we only need the Seq[Expression] here.
There was a problem hiding this comment.
it's a table (rows X columns)
There was a problem hiding this comment.
I know that. You means the X columns for each row is different?
| * @param output list of column attributes | ||
| * @param rows expressions for the data rows | ||
| */ | ||
| case class ResolvedInlineTable(rows: Seq[Seq[Expression]], output: Seq[Attribute]) |
There was a problem hiding this comment.
Shall we simplify rows: Seq[Seq[Expression]] to exprs: Seq[Expression]?
There was a problem hiding this comment.
@dbatomic After review this PR again. I'm sorry for the above comment.
|
The failed test is unrelated, thanks, merging to master/3.5! |
…ne table expressions With this PR proposal is to do inline table resolution in two phases: 1) If there are no expressions that depend on current context (e.g. expressions that depend on CURRENT_DATABASE, CURRENT_USER, CURRENT_TIME etc.) they will be evaluated as part of ResolveInlineTable rule. 2) Expressions that do depend on CURRENT_* evaluation will be kept as expressions and they evaluation will be delayed to post analysis phase. This PR aims to solve two problems with inline tables. Example1: ```sql SELECT COUNT(DISTINCT ct) FROM VALUES (CURRENT_TIMESTAMP()), (CURRENT_TIMESTAMP()), (CURRENT_TIMESTAMP()) as data(ct) ``` Prior to this change this example would return 3 (i.e. all CURRENT_TIMESTAMP expressions would return different value since they would be evaluated individually as part of inline table evaluation). After this change result is 1. Example 2: ```sql CREATE VIEW V as (SELECT * FROM VALUES(CURRENT_TIMESTAMP()) ``` In this example VIEW would be saved with literal evaluated during VIEW creation. After this change CURRENT_TIMESTAMP() will eval during VIEW execution. See section above. New test that validates this behaviour is introduced. No. Closes #44316 from dbatomic/inline_tables_curr_time_fix. Lead-authored-by: Aleksandar Tomic <aleksandar.tomic@databricks.com> Co-authored-by: Aleksandar Tomic <150942779+dbatomic@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5fe963f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…s and ResolveInlineTablesSuite ### What changes were proposed in this pull request? #44316 replace current time/date prior to evaluating inline table expressions. This PR propose to simplify the code for `ResolveInlineTables` and let `ResolveInlineTablesSuite` apply the rule `ResolveInlineTables`. ### Why are the changes needed? Simplify the code for `ResolveInlineTables` and `ResolveInlineTablesSuite`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Test cases updated. GA tests. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44447 from beliefer/SPARK-46380_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
With this PR proposal is to do inline table resolution in two phases:
Why are the changes needed?
This PR aims to solve two problems with inline tables.
Example1:
Prior to this change this example would return 3 (i.e. all CURRENT_TIMESTAMP expressions would return different value since they would be evaluated individually as part of inline table evaluation). After this change result is 1.
Example 2:
In this example VIEW would be saved with literal evaluated during VIEW creation. After this change CURRENT_TIMESTAMP() will eval during VIEW execution.
Does this PR introduce any user-facing change?
See section above.
How was this patch tested?
New test that validates this behaviour is introduced.
Was this patch authored or co-authored using generative AI tooling?
No.