Add corr Spark function by marin-ma · Pull Request #7204 · facebookincubator/velox

marin-ma · 2023-10-24T05:45:34Z

The 'corr' function is implemented differently in Spark and Presto. Specifically, when computing the final result, Presto uses the formula M/(sqrt(A)sqrt(B)), while Spark uses M/sqrt(AB). Even though the difference in precision between the two formulas might seem insignificant, it can result in more noticeable biases with further computations. For instance, when casting to an integer, as demonstrated in the unit test.

This PR extract shared calculations into velox/functions/lib/aggregates/CovarianceAggregatesBase.h, and provide a distinct implementation for spark "corr".

Spark's implementation reference:
https://github.com/apache/spark/blob/43852733307a229944bd254f38bcc1f84bca97fd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Corr.scala#L124-L127

issue #4917

netlify · 2023-10-24T05:45:40Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`a7ce90b`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/65375a01b1897f0008ec8259

marin-ma · 2023-10-24T05:47:15Z

@mbasmanova Could you help to review this PR? Thanks!

marin-ma · 2023-10-24T06:15:39Z

cc: @rui-mo

Yuhta · 2023-10-24T19:43:44Z

velox/functions/sparksql/aggregates/tests/CorrelationAggregationTest.cpp

+      {data},
+      {},
+      {"spark_corr(c0, c1)"},
+      {"cast(a0 as BIGINT)"},


This cast computation itself is not stable, both 0 or 1 are acceptable results. M/(sqrt(A)sqrt(B)) is numerically safer than M/sqrt(AB) and should be preferred.

In sparksql, cast from double to int always truncate the fractional part, so cast(0.99 as int) gets 0, cast(1.0 as int) gets 1, which are different results. We should keep consistent with spark.

Truncating on floating point number to convert it to integer enlarges the error by $1/\epsilon$ so is not stable. The calculation here is deemed to be inaccurate and have nothing to do with how corr is calculated

I agree it's unstable but it's the current implementation in Spark. We need to align with Spark's implementation and get the same result. Otherwise customer will see mismatched result in Velox and Spark. We can remove the logic once Spark use Presto's implementation.

The PR is to required to fix a Spark UT.

@FelixYBW Given that Spark's implementation is unstable, I expect customers already getting inconsistent results. Have you opened an issue with Spark to fix this implementation? It doesn't seem to make sense to implement Spark's bugs in Velox.

No customer report the inconsistent results. The fix is to pass a Spark UT. @marin-ma Is the UT failed due to inconsistent result?

@FelixYBW Yes, one of group-by Spark UT fails. And #4917 was reported by the customer.

In the customer reported issue Velox get more accurate result. So it's more spark's bug.

@marin-ma Let's close the PR, change the expected value as Velox's in Spark UT, then mark the difference in Gluten's doc. Later let's submit a PR to Spark community.

Thank you @mbasmanova and @Yuhta

Yuhta · 2023-10-24T19:50:36Z

The discussion before: #4917

FelixYBW · 2023-11-08T20:15:14Z

FYI, as follow up, @liujiayi771 created a PR[https://github.com/apache/spark/pull/43711] in spark community. Let's see how Spark community comments this

mbasmanova · 2023-11-08T20:28:25Z

@liujiayi771 @FelixYBW Thank you for the follow-up.

extract sparksql corr

a7ce90b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2023

mbasmanova requested review from Yuhta, aditi-pandit and spershin October 24, 2023 19:35

mbasmanova changed the title ~~[VL] Extract "corr" for different implementation in spark~~ Add Spark-specific corr function Oct 24, 2023

mbasmanova changed the title ~~Add Spark-specific corr function~~ Add corr Spark function Oct 24, 2023

Yuhta reviewed Oct 24, 2023

View reviewed changes

marin-ma mentioned this pull request Nov 20, 2023

[VL] Remove corr in group-by.sql and separate supported SQLQueryTest list for backends apache/gluten#3774

Merged

marin-ma closed this Nov 28, 2023

FelixYBW mentioned this pull request Feb 7, 2026

[VL] useful Velox PRs not merged into upstream apache/gluten#11585

Open

Conversation

marin-ma commented Oct 24, 2023

Uh oh!

netlify bot commented Oct 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

marin-ma commented Oct 24, 2023

Uh oh!

marin-ma commented Oct 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yuhta commented Oct 24, 2023

Uh oh!

FelixYBW commented Nov 8, 2023

Uh oh!

mbasmanova commented Nov 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

netlify bot commented Oct 24, 2023 •

edited

Loading