Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
@mbasmanova Could you help to review this PR? Thanks! |
|
cc: @rui-mo |
| {data}, | ||
| {}, | ||
| {"spark_corr(c0, c1)"}, | ||
| {"cast(a0 as BIGINT)"}, |
There was a problem hiding this comment.
This cast computation itself is not stable, both 0 or 1 are acceptable results. M/(sqrt(A)sqrt(B)) is numerically safer than M/sqrt(AB) and should be preferred.
There was a problem hiding this comment.
In sparksql, cast from double to int always truncate the fractional part, so cast(0.99 as int) gets 0, cast(1.0 as int) gets 1, which are different results. We should keep consistent with spark.
There was a problem hiding this comment.
Truncating on floating point number to convert it to integer enlarges the error by corr is calculated
There was a problem hiding this comment.
I agree it's unstable but it's the current implementation in Spark. We need to align with Spark's implementation and get the same result. Otherwise customer will see mismatched result in Velox and Spark. We can remove the logic once Spark use Presto's implementation.
The PR is to required to fix a Spark UT.
There was a problem hiding this comment.
@FelixYBW Given that Spark's implementation is unstable, I expect customers already getting inconsistent results. Have you opened an issue with Spark to fix this implementation? It doesn't seem to make sense to implement Spark's bugs in Velox.
There was a problem hiding this comment.
No customer report the inconsistent results. The fix is to pass a Spark UT. @marin-ma Is the UT failed due to inconsistent result?
There was a problem hiding this comment.
In the customer reported issue Velox get more accurate result. So it's more spark's bug.
@marin-ma Let's close the PR, change the expected value as Velox's in Spark UT, then mark the difference in Gluten's doc. Later let's submit a PR to Spark community.
Thank you @mbasmanova and @Yuhta
|
The discussion before: #4917 |
|
FYI, as follow up, @liujiayi771 created a PR[https://github.com/apache/spark/pull/43711] in spark community. Let's see how Spark community comments this |
|
@liujiayi771 @FelixYBW Thank you for the follow-up. |
The 'corr' function is implemented differently in Spark and Presto. Specifically, when computing the final result, Presto uses the formula M/(sqrt(A)sqrt(B)), while Spark uses M/sqrt(AB). Even though the difference in precision between the two formulas might seem insignificant, it can result in more noticeable biases with further computations. For instance, when casting to an integer, as demonstrated in the unit test.
This PR extract shared calculations into
velox/functions/lib/aggregates/CovarianceAggregatesBase.h, and provide a distinct implementation for spark "corr".Spark's implementation reference:
https://github.com/apache/spark/blob/43852733307a229944bd254f38bcc1f84bca97fd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Corr.scala#L124-L127
issue #4917