[SPARK-45834][SQL] Fix Pearson correlation calculation more stable#43711
[SPARK-45834][SQL] Fix Pearson correlation calculation more stable#43711liujiayi771 wants to merge 2 commits intoapache:masterfrom
Conversation
e18780e to
c4e262d
Compare
|
CC: @FelixYBW |
|
Seems the test failed related to this pr? https://github.com/liujiayi771/spark/actions/runs/6794362848/job/18471276466 |
|
cc @beliefer FYI |
|
@LuciferYang The modifications have caused a change in the precision of the calculation results. I will fix them all. |
|
@liujiayi771 Thank you for the fix. I will see later. |
|
The potential side effect of this modification is that it is easier to obtain a finite number for Math.sqrt(2 * 2) = 2.0
Math.sqrt(2) * Math.sqrt(2) = 2.0000000000000004 |
| struct<corr(DISTINCT x, y):double,corr(DISTINCT y, x):double,count(1):bigint> | ||
| -- !query output | ||
| 1.0 1.0 3 | ||
| 0.9999999999999999 0.9999999999999999 3 |
There was a problem hiding this comment.
I guess 0.9999999999999999 is incorrect.
There was a problem hiding this comment.
The result is not incorrect, it is just a precision issue with double. For example,
2 / Math.sqrt(2 * 2) = 1.0
2 / Math.sqrt(2) / Math.sqrt(2) = 0.9999999999999999From the user's perspective, 1.0 is more user-friendly.
I am currently unsure about whether to sacrifice user-friendliness in order to support an extreme case.
There was a problem hiding this comment.
You can reference the output of other mainstream databases.
There was a problem hiding this comment.
@beliefer Different mainstream databases have both of these calculation formulas. But I now believe that there is no need to modify Spark's code for this extreme case because Spark's formula can easily solve for finite decimals.
Thanks, I will close this PR.

What changes were proposed in this pull request?
Modify the calculation formula of Pearson correlation.
Why are the changes needed?
Spark uses the formula
ck / sqrt(xMk * yMk)to calculate the Pearson Correlation Coefficient. IfxMkandyMkare very small, it can lead to double multiplication overflow, resulting in a denominator of 0. This leads to an Infinity result in the calculation.For example, when calculating the correlation for the same columns a and b in the following table, the result will be Infinity, but the correlation for identical columns should be 1.0 instead.
Modifying the formula to
ck / sqrt(xMk) / sqrt(yMk)can indeed solve this issue and improve the stability of the calculation. The benefit of this modification is that it splits the square root of the denominator into two parts:sqrt(xMk)andsqrt(yMk). This helps avoid multiplication overflow or cases where the product of extremely small values becomes zero.Does this PR introduce any user-facing change?
The precision of the decimal last digit in some results may change. For example, the corr result for two columns that used to be equal to 1.0 may now become 0.9999999999999999.
How was this patch tested?
Add a unit test case in DataFrameStatSuite.
Was this patch authored or co-authored using generative AI tooling?
No