[SPARK-46380][SQL]Replace current time/date prior to evaluating inline table expressions. by dbatomic · Pull Request #44316 · apache/spark

dbatomic · 2023-12-12T13:57:01Z

What changes were proposed in this pull request?

With this PR proposal is to do inline table resolution in two phases:

If there are no expressions that depend on current context (e.g. expressions that depend on CURRENT_DATABASE, CURRENT_USER, CURRENT_TIME etc.) they will be evaluated as part of ResolveInlineTable rule.
Expressions that do depend on CURRENT_* evaluation will be kept as expressions and they evaluation will be delayed to post analysis phase.

Why are the changes needed?

This PR aims to solve two problems with inline tables.

Example1:

SELECT COUNT(DISTINCT ct) FROM VALUES
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()) as data(ct)

Prior to this change this example would return 3 (i.e. all CURRENT_TIMESTAMP expressions would return different value since they would be evaluated individually as part of inline table evaluation). After this change result is 1.

Example 2:

CREATE VIEW V as (SELECT * FROM VALUES(CURRENT_TIMESTAMP())

In this example VIEW would be saved with literal evaluated during VIEW creation. After this change CURRENT_TIMESTAMP() will eval during VIEW execution.

Does this PR introduce any user-facing change?

See section above.

How was this patch tested?

New test that validates this behaviour is introduced.

Was this patch authored or co-authored using generative AI tooling?

No.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala

beliefer

I think we need a discussion.

SELECT COUNT(DISTINCT ct) FROM VALUES
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()) as data(ct)

The three call of CURRENT_TIMESTAMP() should have the same value?

dbatomic · 2023-12-13T14:04:41Z

I think we need a discussion.
SELECT COUNT(DISTINCT ct) FROM VALUES
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()) as data(ct)
The three call of CURRENT_TIMESTAMP() should have the same value?

That's right. All the invocations of CURRENT_TIMESTAMP/CURRENT_DATE/NOW() should be replaced with single value which represents time of query arrival. We already do this for majority of scenarios (e.g. if time function is pretty much anywhere else in the query). The bug this PR tries to solve is that we don't do this for inline tables.

beliefer · 2023-12-14T08:27:58Z

I checked in Postgres.
select CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP;
The output:

current_timestamp            |current_timestamp            |current_timestamp            |
-----------------------------+-----------------------------+-----------------------------+
2023-12-14 16:27:19.663 +0800|2023-12-14 16:27:19.663 +0800|2023-12-14 16:27:19.663 +0800|

dbatomic · 2023-12-14T09:36:36Z

I checked in Postgres. select CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP; The output:

current_timestamp            |current_timestamp            |current_timestamp            |
-----------------------------+-----------------------------+-----------------------------+
2023-12-14 16:27:19.663 +0800|2023-12-14 16:27:19.663 +0800|2023-12-14 16:27:19.663 +0800|

Yep, just to illustrate that replacement happens for inline tables:

postgres=# SELECT * FROM (VALUES(EXTRACT(epoch FROM current_timestamp),EXTRACT(epoch FROM current_timestamp),EXTRACT(epoch FROM current_timestamp)));

  column1      |      column2      |      column3

-------------------+-------------------+-------------------
1702546463.398428 | 1702546463.398428 | 1702546463.398428

1) ResolveInlineTables that will check the shape and add all the needed casts. 2) EvalInlineTables that will call the actual evaluation of the rows into LocalRelation at the end of finish analysis phase.

srielau · 2023-12-15T16:35:02Z

I agree with the semantic changes. General comment, though:
VALUES in its current implementation is much too restrictive.
Will this PR move us closer to allowing arbitrary expressions (including non determinism and correlation)?
E.g. the following cannot be done in VALUES today:
scala> spark.sql("VALUES (rand())").show()
org.apache.spark.sql.AnalysisException: [INVALID_INLINE_TABLE.CANNOT_EVALUATE_EXPRESSION_IN_INLINE_TABLE] Invalid inline table. Cannot evaluate the expression "rand()" in inline table definition. SQLSTATE: 42000; line 1 pos 8

or:
SELECT pk, c FROM T, LATERAL(VALUES(T.c1), (T.c2)) AS unpivot(c)

dbatomic · 2023-12-15T17:08:37Z

I agree with the semantic changes. General comment, though: VALUES in its current implementation is much too restrictive. Will this PR move us closer to allowing arbitrary expressions (including non determinism and correlation)? E.g. the following cannot be done in VALUES today: scala> spark.sql("VALUES (rand())").show() org.apache.spark.sql.AnalysisException: [INVALID_INLINE_TABLE.CANNOT_EVALUATE_EXPRESSION_IN_INLINE_TABLE] Invalid inline table. Cannot evaluate the expression "rand()" in inline table definition. SQLSTATE: 42000; line 1 pos 8

or: SELECT pk, c FROM T, LATERAL(VALUES(T.c1), (T.c2)) AS unpivot(c)

Yeah, I maybe can use this PR to deal with rand() problem as well (i.e. allow non-deterministic expressions in VALUES), because it is rather similar to CURRENT_* issue.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala

cloud-fan · 2023-12-19T06:37:03Z

sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

+      """SELECT COUNT(DISTINCT ct) FROM VALUES
+        | CURRENT_TIMESTAMP(),
+        | CURRENT_TIMESTAMP(),
+        | CURRENT_TIMESTAMP() as data(ct)""".stripMargin), Row(1))


This seems not needed as we have the same test in the golden file

cloud-fan · 2023-12-19T13:11:47Z

sql/core/src/test/resources/sql-tests/results/inline-table.sql.out

+
+
+-- !query
+select count(distinct ct) from values now(), now(), now() as data(ct)


We can add a test for CURRENT_TIMESTAMP and remove https://github.com/apache/spark/pull/44316/files#r1430980065

cloud-fan · 2023-12-19T13:12:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala

+      def earlyEvalPossible =
+        table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE))
+      if (earlyEvalPossible) EvalInlineTables(table) else table


Suggested change

def earlyEvalPossible =

table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE))

if (earlyEvalPossible) EvalInlineTables(table) else table

val earlyEvalPossible = table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE))

if (earlyEvalPossible) EvalInlineTables(table) else table

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala

cloud-fan · 2023-12-19T13:13:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala

+ */
+object EvalInlineTables extends Rule[LogicalPlan] with CastSupport {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithSubqueriesAndPruning(
+    AlwaysProcess.fn, ruleId) {


let's add pruning for ResolvedInlineTable

beliefer

LGTM.

beliefer · 2023-12-20T13:33:31Z

sql/core/src/test/resources/sql-tests/inputs/inline-table.sql

+select count(distinct ct) from values now(), now(), now() as data(ct);
+
+-- current_timestamp() should be kept as tempResolved inline expression.
+select count(distinct ct) from values current_timestamp(), current_timestamp() as data(ct);


Shall we add tests mixed current_timestamp and other deterministic function?

it's testing the correct value using count distinct.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala

…mizer/finishAnalysis.scala Co-authored-by: Jiaan Geng <beliefer@163.com>

beliefer · 2023-12-21T01:45:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala

-      InternalRow.fromSeq(row.zipWithIndex.map { case (e, ci) =>
-        val targetType = fields(ci).dataType
-        try {
+    val castedRows: Seq[Seq[Expression]] = table.rows.map { row =>


~~It seems we only need the Seq[Expression] here.~~

it's a table (rows X columns)

I know that. You means the X columns for each row is different?

I got it now. Thank you!

beliefer · 2023-12-21T01:46:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

+ * @param output list of column attributes
+ * @param rows expressions for the data rows
+ */
+case class ResolvedInlineTable(rows: Seq[Seq[Expression]], output: Seq[Attribute])


~~Shall we simplify rows: Seq[Seq[Expression]] to exprs: Seq[Expression]?~~

@dbatomic After review this PR again. I'm sorry for the above comment.

cloud-fan · 2023-12-21T07:57:19Z

The failed test is unrelated, thanks, merging to master/3.5!

…ne table expressions With this PR proposal is to do inline table resolution in two phases: 1) If there are no expressions that depend on current context (e.g. expressions that depend on CURRENT_DATABASE, CURRENT_USER, CURRENT_TIME etc.) they will be evaluated as part of ResolveInlineTable rule. 2) Expressions that do depend on CURRENT_* evaluation will be kept as expressions and they evaluation will be delayed to post analysis phase. This PR aims to solve two problems with inline tables. Example1: ```sql SELECT COUNT(DISTINCT ct) FROM VALUES (CURRENT_TIMESTAMP()), (CURRENT_TIMESTAMP()), (CURRENT_TIMESTAMP()) as data(ct) ``` Prior to this change this example would return 3 (i.e. all CURRENT_TIMESTAMP expressions would return different value since they would be evaluated individually as part of inline table evaluation). After this change result is 1. Example 2: ```sql CREATE VIEW V as (SELECT * FROM VALUES(CURRENT_TIMESTAMP()) ``` In this example VIEW would be saved with literal evaluated during VIEW creation. After this change CURRENT_TIMESTAMP() will eval during VIEW execution. See section above. New test that validates this behaviour is introduced. No. Closes #44316 from dbatomic/inline_tables_curr_time_fix. Lead-authored-by: Aleksandar Tomic <aleksandar.tomic@databricks.com> Co-authored-by: Aleksandar Tomic <150942779+dbatomic@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5fe963f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…s and ResolveInlineTablesSuite ### What changes were proposed in this pull request? #44316 replace current time/date prior to evaluating inline table expressions. This PR propose to simplify the code for `ResolveInlineTables` and let `ResolveInlineTablesSuite` apply the rule `ResolveInlineTables`. ### Why are the changes needed? Simplify the code for `ResolveInlineTables` and `ResolveInlineTablesSuite`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Test cases updated. GA tests. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44447 from beliefer/SPARK-46380_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Making sure that time resolution is done prior to inline table eval

6757409

github-actions bot added the SQL label Dec 12, 2023

dbatomic mentioned this pull request Dec 12, 2023

[SPARK-46331][SQL] Removing CodegenFallback from subset of DateTime expressions and version() expression #44261

Closed

MaxGekk reviewed Dec 12, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 12, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala Outdated Show resolved Hide resolved

beliefer reviewed Dec 13, 2023

View reviewed changes

dbatomic added 3 commits December 14, 2023 17:05

Splitting inline table resolution in two parts:

ed33d5e

1) ResolveInlineTables that will check the shape and add all the needed casts. 2) EvalInlineTables that will call the actual evaluation of the rows into LocalRelation at the end of finish analysis phase.

Splitting inline table resolution in two parts:

0127596

1) ResolveInlineTables that will check the shape and add all the needed casts. 2) EvalInlineTables that will call the actual evaluation of the rows into LocalRelation at the end of finish analysis phase.

Fixing inline view test

5e24fe1

cloud-fan reviewed Dec 15, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala Outdated Show resolved Hide resolved

Minor polishing

f019b9c