Skip to content

[SPARK-45834][SQL] Fix Pearson correlation calculation more stable#43711

Closed
liujiayi771 wants to merge 2 commits intoapache:masterfrom
liujiayi771:corr-accuracy
Closed

[SPARK-45834][SQL] Fix Pearson correlation calculation more stable#43711
liujiayi771 wants to merge 2 commits intoapache:masterfrom
liujiayi771:corr-accuracy

Conversation

@liujiayi771
Copy link
Copy Markdown
Contributor

@liujiayi771 liujiayi771 commented Nov 8, 2023

What changes were proposed in this pull request?

Modify the calculation formula of Pearson correlation.

Why are the changes needed?

Spark uses the formula ck / sqrt(xMk * yMk) to calculate the Pearson Correlation Coefficient. If xMk and yMk are very small, it can lead to double multiplication overflow, resulting in a denominator of 0. This leads to an Infinity result in the calculation.

For example, when calculating the correlation for the same columns a and b in the following table, the result will be Infinity, but the correlation for identical columns should be 1.0 instead.

a b
1e-200 1e-200
1e-200 1e-200
1e-100 1e-100
scala> val tinyDouble = Seq(1e-200, 1e-200, 1e-100)
tinyDouble: Seq[Double] = List(1.0E-200, 1.0E-200, 1.0E-100)

scala> val df3 = tinyDouble.zip(tinyDouble).toDF("a", "b")
df3: org.apache.spark.sql.DataFrame = [a: double, b: double]

scala> df3.stat.corr("a", "b", "pearson")
res0: Double = Infinity

Modifying the formula to ck / sqrt(xMk) / sqrt(yMk) can indeed solve this issue and improve the stability of the calculation. The benefit of this modification is that it splits the square root of the denominator into two parts: sqrt(xMk) and sqrt(yMk). This helps avoid multiplication overflow or cases where the product of extremely small values becomes zero.

Does this PR introduce any user-facing change?

The precision of the decimal last digit in some results may change. For example, the corr result for two columns that used to be equal to 1.0 may now become 0.9999999999999999.

How was this patch tested?

Add a unit test case in DataFrameStatSuite.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 8, 2023
@liujiayi771
Copy link
Copy Markdown
Contributor Author

CC: @FelixYBW

@LuciferYang
Copy link
Copy Markdown
Contributor

Seems the test failed related to this pr?

https://github.com/liujiayi771/spark/actions/runs/6794362848/job/18471276466

image

@LuciferYang
Copy link
Copy Markdown
Contributor

cc @beliefer FYI

@liujiayi771
Copy link
Copy Markdown
Contributor Author

@LuciferYang The modifications have caused a change in the precision of the calculation results. I will fix them all.

@beliefer
Copy link
Copy Markdown
Contributor

beliefer commented Nov 8, 2023

@liujiayi771 Thank you for the fix. I will see later.

@liujiayi771
Copy link
Copy Markdown
Contributor Author

The potential side effect of this modification is that it is easier to obtain a finite number for sqrt(xMk * yMk), while sqrt(xMk) * sqrt(yMk) can easily result in an infinite number, for example,

Math.sqrt(2 * 2) = 2.0
Math.sqrt(2) * Math.sqrt(2) = 2.0000000000000004

struct<corr(DISTINCT x, y):double,corr(DISTINCT y, x):double,count(1):bigint>
-- !query output
1.0 1.0 3
0.9999999999999999 0.9999999999999999 3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 0.9999999999999999 is incorrect.

Copy link
Copy Markdown
Contributor Author

@liujiayi771 liujiayi771 Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result is not incorrect, it is just a precision issue with double. For example,

2 / Math.sqrt(2 * 2) = 1.0
2 / Math.sqrt(2) / Math.sqrt(2) = 0.9999999999999999

From the user's perspective, 1.0 is more user-friendly.
I am currently unsure about whether to sacrifice user-friendliness in order to support an extreme case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reference the output of other mainstream databases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beliefer Different mainstream databases have both of these calculation formulas. But I now believe that there is no need to modify Spark's code for this extreme case because Spark's formula can easily solve for finite decimals.
Thanks, I will close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants