Skip to content

Add corr Spark function#7204

Closed
marin-ma wants to merge 1 commit intofacebookincubator:mainfrom
marin-ma:spark-sql-corr
Closed

Add corr Spark function#7204
marin-ma wants to merge 1 commit intofacebookincubator:mainfrom
marin-ma:spark-sql-corr

Conversation

@marin-ma
Copy link
Copy Markdown
Collaborator

The 'corr' function is implemented differently in Spark and Presto. Specifically, when computing the final result, Presto uses the formula M/(sqrt(A)sqrt(B)), while Spark uses M/sqrt(AB). Even though the difference in precision between the two formulas might seem insignificant, it can result in more noticeable biases with further computations. For instance, when casting to an integer, as demonstrated in the unit test.

This PR extract shared calculations into velox/functions/lib/aggregates/CovarianceAggregatesBase.h, and provide a distinct implementation for spark "corr".

Spark's implementation reference:
https://github.com/apache/spark/blob/43852733307a229944bd254f38bcc1f84bca97fd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Corr.scala#L124-L127

issue #4917

@netlify
Copy link
Copy Markdown

netlify bot commented Oct 24, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit a7ce90b
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65375a01b1897f0008ec8259

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2023
@marin-ma
Copy link
Copy Markdown
Collaborator Author

@mbasmanova Could you help to review this PR? Thanks!

@marin-ma
Copy link
Copy Markdown
Collaborator Author

cc: @rui-mo

@mbasmanova mbasmanova changed the title [VL] Extract "corr" for different implementation in spark Add Spark-specific corr function Oct 24, 2023
@mbasmanova mbasmanova changed the title Add Spark-specific corr function Add corr Spark function Oct 24, 2023
{data},
{},
{"spark_corr(c0, c1)"},
{"cast(a0 as BIGINT)"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cast computation itself is not stable, both 0 or 1 are acceptable results. M/(sqrt(A)sqrt(B)) is numerically safer than M/sqrt(AB) and should be preferred.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sparksql, cast from double to int always truncate the fractional part, so cast(0.99 as int) gets 0, cast(1.0 as int) gets 1, which are different results. We should keep consistent with spark.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncating on floating point number to convert it to integer enlarges the error by $1/\epsilon$ so is not stable. The calculation here is deemed to be inaccurate and have nothing to do with how corr is calculated

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's unstable but it's the current implementation in Spark. We need to align with Spark's implementation and get the same result. Otherwise customer will see mismatched result in Velox and Spark. We can remove the logic once Spark use Presto's implementation.

The PR is to required to fix a Spark UT.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FelixYBW Given that Spark's implementation is unstable, I expect customers already getting inconsistent results. Have you opened an issue with Spark to fix this implementation? It doesn't seem to make sense to implement Spark's bugs in Velox.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No customer report the inconsistent results. The fix is to pass a Spark UT. @marin-ma Is the UT failed due to inconsistent result?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FelixYBW Yes, one of group-by Spark UT fails. And #4917 was reported by the customer.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the customer reported issue Velox get more accurate result. So it's more spark's bug.

@marin-ma Let's close the PR, change the expected value as Velox's in Spark UT, then mark the difference in Gluten's doc. Later let's submit a PR to Spark community.

Thank you @mbasmanova and @Yuhta

@Yuhta
Copy link
Copy Markdown
Contributor

Yuhta commented Oct 24, 2023

The discussion before: #4917

@FelixYBW
Copy link
Copy Markdown

FelixYBW commented Nov 8, 2023

FYI, as follow up, @liujiayi771 created a PR[https://github.com/apache/spark/pull/43711] in spark community. Let's see how Spark community comments this

@mbasmanova
Copy link
Copy Markdown
Contributor

@liujiayi771 @FelixYBW Thank you for the follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants