BUG: handle corner cases for correlation by giorgiop · Pull Request #5301 · scipy/scipy

giorgiop · 2015-10-01T13:56:04Z

Following discussion from /scikit-learn#4475, it would be nice to have corner cases handled inside scipy's correlation directly. Although correlation is not mathematically defined for constant input vector, it makes sense to complete the definition at the limits of vectors with zero variance as 1, the middle point for the domain of the correlation distance. One vector's lack of variance does not say anything about any other vector variation.

The choice is also justified by the fact that if one's variance is zero, it follows that the covariance is also 0, and therefore correlation distance is 1.

I did not observe issues with other distance functions, except yule.

giorgiop · 2015-10-02T08:22:15Z

The problem extends to input vector that are almost constant, due to finite arithmetic issues in the division by zero. The script

import numpy as np
from scipy.spatial.distance import correlation

X1 = np.array([1, 1, 1])
noise = np.array([.5, 0, -.01])
X2 = np.array([2, 4, 6])
for eps in np.linspace(-1e-15, .0, 50):
    print(correlation(X1 + eps * noise, X2))

outputs (on master branch)

0.166666666667
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
RuntimeWarning: invalid value encountered in double_scalars
  dist = 1.0 - np.dot(um, vm) / (norm(um) * norm(vm))
nan
nan
nan
nan
nan
nan

jaimefrio · 2015-10-02T14:13:21Z

scipy/spatial/distance.py

You are iterating over each array one more time by calling ptp on them. Wouldn't it be better to store norm(um) and norm(vm), and do the check on them? Also, if both vectors are constant, is the correlation still 0 or is it 1 then?

Strictly speaking, the correlation is not defined. But correlation is a measure based on covariance, and covariance between anything and a constant is always 0. It might sounds paradoxical, but even the same constant vector has covariance 0 with itself. The division by zero comes from the normalization of this measure by the variance (which is 0 as well).

argriffing · 2015-10-02T16:10:05Z

It's not clear to me that "BUG: handle corner case" really applies here. The correct limit of the output as the input approaches the corner depends on the direction from which it is approaching, right? I think this is a classic reason to use NaN, analogous to the reasoning behind NaN for x/y as x->0 and y->0.

giorgiop · 2015-10-05T09:15:50Z

I think this is a classic reason to use NaN, analogous to the reasoning behind NaN for x/y as x->0 and y->0.

I agree. Although, I think that most if not all use cases of correlation have to check whether a nan is returned and if that nan was returned because one of the vector was (nearly) constant, and not for other reasons --for example if a nan was already present in the input vectors. Constant inputs can happen with real data and therefore handling that corner case is necessary for robustness. That's the kind of wrapper I am thinking to implement to fix those issues in sklearn.

In case we do not find this "bug fix" appropriate on the scipy side, I would like correlation's docstring to be more explicit about the nan outputs. I can change the PR in this sense if you guys agree.

jakevdp · 2015-10-08T13:27:37Z

You're going to have to further convince me that it's correct to return 0 rather than NaN in these cases. Mathematically, the stated formula gives 0/0 -> NaN. Conceptually, the correlation distance is a measure of the linear relationship between two variables, and my intuition is that if there is no baseline to measure against then the question of linear relationship is meaningless, and NaN is as good a "missing data" marker as we have.

ev-br · 2015-10-16T17:16:01Z

OK, there are two apparent -1s, and no action from OP for a week. FTR and FWIW, I agree with Jake and Alex above. The stats.stackexchange link quoted as a justification has a bit more discussion and I fail to read it as a full justification of the suggestion here. IMO, a reference to well-reputed literature could sway it. Otherwise, I'd recommend rejecting and closing this PR.

josef-pkt · 2015-10-16T17:28:16Z

Note, if this were in stats, I would have recommended "don't fix", similar to the discussion for the t-test (where I was on the no nan side and got convinced otherwise).

In spatial it could be a bit more pragmatic, IMO, but I don't see a convincing argument for any specific value as alternative to nan.

giorgiop · 2015-10-16T17:33:53Z

I have stopped working on this, waiting for more opinions, and decide to "fix" those issues on the scikit-learn side meanwhile.

I am good with your decision, if you want to close the PR. I suggest that those limit behaviours get documented in the docstrings though.

giorgiop · 2015-10-27T08:09:47Z

If you agree to touch the documentation of the distance functions, I can work on that.

ev-br · 2015-11-09T12:11:53Z

I think clarifying the documentation would be great @giorgiop. Either as a replacement of this PR (then please also change the title), or close this PR and send another one for the documentation, whichever you prefer. Thanks!

ev-br · 2016-01-18T19:10:13Z

Ok, the agreement back in October seemed to be that we'd rather keep the current behavior, and there is no activity since then. I'm then closing this as a wontfix. If at some point in time we want to reconsider, the patch is still available. Thank you @giorgiop, thanks to reviewers.

tests and fix

b90eb60

giorgiop force-pushed the distance-corner-cases branch from 7dceeda to b90eb60 Compare October 1, 2015 13:57

giorgiop mentioned this pull request Oct 1, 2015

[MRG] pairwise_distances outputs Nan and negative values scikit-learn/scikit-learn#5333

Closed

4 tasks

larsmans added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial labels Oct 1, 2015

add test almost const input

4ed9259

jaimefrio reviewed Oct 2, 2015
View reviewed changes

ev-br added the needs-work Items that are pending response from the author label Nov 9, 2015

ev-br closed this Jan 18, 2016

ev-br added wontfix Not actionable, rejected or unplanned changes and removed needs-work Items that are pending response from the author labels Jan 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: handle corner cases for correlation#5301

BUG: handle corner cases for correlation#5301
giorgiop wants to merge 2 commits intoscipy:masterfrom
giorgiop:distance-corner-cases

giorgiop commented Oct 1, 2015

Uh oh!

giorgiop commented Oct 2, 2015

Uh oh!

jaimefrio Oct 2, 2015

Uh oh!

giorgiop Oct 2, 2015

Uh oh!

argriffing commented Oct 2, 2015

Uh oh!

giorgiop commented Oct 5, 2015

Uh oh!

jakevdp commented Oct 8, 2015

Uh oh!

ev-br commented Oct 16, 2015

Uh oh!

josef-pkt commented Oct 16, 2015

Uh oh!

giorgiop commented Oct 16, 2015

Uh oh!

giorgiop commented Oct 27, 2015

Uh oh!

ev-br commented Nov 9, 2015

Uh oh!

ev-br commented Jan 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

giorgiop commented Oct 1, 2015

Uh oh!

giorgiop commented Oct 2, 2015

Uh oh!

jaimefrio Oct 2, 2015

Choose a reason for hiding this comment

Uh oh!

giorgiop Oct 2, 2015

Choose a reason for hiding this comment

Uh oh!

argriffing commented Oct 2, 2015

Uh oh!

giorgiop commented Oct 5, 2015

Uh oh!

jakevdp commented Oct 8, 2015

Uh oh!

ev-br commented Oct 16, 2015

Uh oh!

josef-pkt commented Oct 16, 2015

Uh oh!

giorgiop commented Oct 16, 2015

Uh oh!

giorgiop commented Oct 27, 2015

Uh oh!

ev-br commented Nov 9, 2015

Uh oh!

ev-br commented Jan 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants