[SPARK-45834][SQL] Fix Pearson correlation calculation more stable by liujiayi771 · Pull Request #43711 · apache/spark

liujiayi771 · 2023-11-08T05:53:24Z

What changes were proposed in this pull request?

Modify the calculation formula of Pearson correlation.

Why are the changes needed?

Spark uses the formula ck / sqrt(xMk * yMk) to calculate the Pearson Correlation Coefficient. If xMk and yMk are very small, it can lead to double multiplication overflow, resulting in a denominator of 0. This leads to an Infinity result in the calculation.

For example, when calculating the correlation for the same columns a and b in the following table, the result will be Infinity, but the correlation for identical columns should be 1.0 instead.

a	b
1e-200	1e-200
1e-200	1e-200
1e-100	1e-100

scala> val tinyDouble = Seq(1e-200, 1e-200, 1e-100)
tinyDouble: Seq[Double] = List(1.0E-200, 1.0E-200, 1.0E-100)

scala> val df3 = tinyDouble.zip(tinyDouble).toDF("a", "b")
df3: org.apache.spark.sql.DataFrame = [a: double, b: double]

scala> df3.stat.corr("a", "b", "pearson")
res0: Double = Infinity

Modifying the formula to ck / sqrt(xMk) / sqrt(yMk) can indeed solve this issue and improve the stability of the calculation. The benefit of this modification is that it splits the square root of the denominator into two parts: sqrt(xMk) and sqrt(yMk). This helps avoid multiplication overflow or cases where the product of extremely small values becomes zero.

Does this PR introduce any user-facing change?

The precision of the decimal last digit in some results may change. For example, the corr result for two columns that used to be equal to 1.0 may now become 0.9999999999999999.

How was this patch tested?

Add a unit test case in DataFrameStatSuite.

Was this patch authored or co-authored using generative AI tooling?

No

liujiayi771 · 2023-11-08T06:04:08Z

CC: @FelixYBW

LuciferYang · 2023-11-08T12:09:13Z

Seems the test failed related to this pr?

https://github.com/liujiayi771/spark/actions/runs/6794362848/job/18471276466

LuciferYang · 2023-11-08T12:18:21Z

cc @beliefer FYI

liujiayi771 · 2023-11-08T12:23:37Z

@LuciferYang The modifications have caused a change in the precision of the calculation results. I will fix them all.

beliefer · 2023-11-08T13:01:08Z

@liujiayi771 Thank you for the fix. I will see later.

liujiayi771 · 2023-11-09T02:12:51Z

The potential side effect of this modification is that it is easier to obtain a finite number for sqrt(xMk * yMk), while sqrt(xMk) * sqrt(yMk) can easily result in an infinite number, for example,

Math.sqrt(2 * 2) = 2.0
Math.sqrt(2) * Math.sqrt(2) = 2.0000000000000004

beliefer · 2023-11-09T02:49:59Z

sql/core/src/test/resources/sql-tests/results/group-by.sql.out

 struct<corr(DISTINCT x, y):double,corr(DISTINCT y, x):double,count(1):bigint>
 -- !query output
-1.0	1.0	3
+0.9999999999999999	0.9999999999999999	3


I guess 0.9999999999999999 is incorrect.

The result is not incorrect, it is just a precision issue with double. For example,

2 / Math.sqrt(2 * 2) = 1.0 2 / Math.sqrt(2) / Math.sqrt(2) = 0.9999999999999999

From the user's perspective, 1.0 is more user-friendly.
I am currently unsure about whether to sacrifice user-friendliness in order to support an extreme case.

You can reference the output of other mainstream databases.

@beliefer Different mainstream databases have both of these calculation formulas. But I now believe that there is no need to modify Spark's code for this extreme case because Spark's formula can easily solve for finite decimals.
Thanks, I will close this PR.

github-actions bot added the SQL label Nov 8, 2023

[SPARK-45834] Fix Pearson correlation calculation more stable

c4e262d

liujiayi771 force-pushed the corr-accuracy branch from e18780e to c4e262d Compare November 8, 2023 05:59

fix UT

720705c

github-actions bot added STRUCTURED STREAMING PYTHON labels Nov 8, 2023

FelixYBW mentioned this pull request Nov 8, 2023

Add corr Spark function facebookincubator/velox#7204

Closed

beliefer reviewed Nov 9, 2023

View reviewed changes

liujiayi771 closed this Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45834][SQL] Fix Pearson correlation calculation more stable#43711

[SPARK-45834][SQL] Fix Pearson correlation calculation more stable#43711
liujiayi771 wants to merge 2 commits intoapache:masterfrom
liujiayi771:corr-accuracy

liujiayi771 commented Nov 8, 2023 •

edited

Loading

Uh oh!

liujiayi771 commented Nov 8, 2023

Uh oh!

LuciferYang commented Nov 8, 2023

Uh oh!

LuciferYang commented Nov 8, 2023

Uh oh!

liujiayi771 commented Nov 8, 2023

Uh oh!

beliefer commented Nov 8, 2023

Uh oh!

liujiayi771 commented Nov 9, 2023

Uh oh!

beliefer Nov 9, 2023

Uh oh!

liujiayi771 Nov 9, 2023 •

edited

Loading

Uh oh!

beliefer Nov 9, 2023

Uh oh!

liujiayi771 Nov 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liujiayi771 commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

liujiayi771 commented Nov 8, 2023

Uh oh!

LuciferYang commented Nov 8, 2023

Uh oh!

LuciferYang commented Nov 8, 2023

Uh oh!

liujiayi771 commented Nov 8, 2023

Uh oh!

beliefer commented Nov 8, 2023

Uh oh!

liujiayi771 commented Nov 9, 2023

Uh oh!

beliefer Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

liujiayi771 Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

liujiayi771 Nov 20, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liujiayi771 commented Nov 8, 2023 •

edited

Loading

liujiayi771 Nov 9, 2023 •

edited

Loading