Skip to content

BUG: handle corner cases for correlation#5301

Closed
giorgiop wants to merge 2 commits intoscipy:masterfrom
giorgiop:distance-corner-cases
Closed

BUG: handle corner cases for correlation#5301
giorgiop wants to merge 2 commits intoscipy:masterfrom
giorgiop:distance-corner-cases

Conversation

@giorgiop
Copy link
Copy Markdown

@giorgiop giorgiop commented Oct 1, 2015

Following discussion from /scikit-learn#4475, it would be nice to have corner cases handled inside scipy's correlation directly. Although correlation is not mathematically defined for constant input vector, it makes sense to complete the definition at the limits of vectors with zero variance as 1, the middle point for the domain of the correlation distance. One vector's lack of variance does not say anything about any other vector variation.

The choice is also justified by the fact that if one's variance is zero, it follows that the covariance is also 0, and therefore correlation distance is 1.

I did not observe issues with other distance functions, except yule.

@giorgiop giorgiop force-pushed the distance-corner-cases branch from 7dceeda to b90eb60 Compare October 1, 2015 13:57
@larsmans larsmans added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial labels Oct 1, 2015
@giorgiop
Copy link
Copy Markdown
Author

giorgiop commented Oct 2, 2015

The problem extends to input vector that are almost constant, due to finite arithmetic issues in the division by zero. The script

import numpy as np
from scipy.spatial.distance import correlation

X1 = np.array([1, 1, 1])
noise = np.array([.5, 0, -.01])
X2 = np.array([2, 4, 6])
for eps in np.linspace(-1e-15, .0, 50):
    print(correlation(X1 + eps * noise, X2))

outputs (on master branch)

0.166666666667
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.147197134578
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.133974596216
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
0.292893218813
RuntimeWarning: invalid value encountered in double_scalars
  dist = 1.0 - np.dot(um, vm) / (norm(um) * norm(vm))
nan
nan
nan
nan
nan
nan

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are iterating over each array one more time by calling ptp on them. Wouldn't it be better to store norm(um) and norm(vm), and do the check on them? Also, if both vectors are constant, is the correlation still 0 or is it 1 then?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the correlation is not defined. But correlation is a measure based on covariance, and covariance between anything and a constant is always 0. It might sounds paradoxical, but even the same constant vector has covariance 0 with itself. The division by zero comes from the normalization of this measure by the variance (which is 0 as well).

@argriffing
Copy link
Copy Markdown
Contributor

It's not clear to me that "BUG: handle corner case" really applies here. The correct limit of the output as the input approaches the corner depends on the direction from which it is approaching, right? I think this is a classic reason to use NaN, analogous to the reasoning behind NaN for x/y as x->0 and y->0.

@giorgiop
Copy link
Copy Markdown
Author

giorgiop commented Oct 5, 2015

I think this is a classic reason to use NaN, analogous to the reasoning behind NaN for x/y as x->0 and y->0.

I agree. Although, I think that most if not all use cases of correlation have to check whether a nan is returned and if that nan was returned because one of the vector was (nearly) constant, and not for other reasons --for example if a nan was already present in the input vectors. Constant inputs can happen with real data and therefore handling that corner case is necessary for robustness. That's the kind of wrapper I am thinking to implement to fix those issues in sklearn.

In case we do not find this "bug fix" appropriate on the scipy side, I would like correlation's docstring to be more explicit about the nan outputs. I can change the PR in this sense if you guys agree.

@jakevdp
Copy link
Copy Markdown
Member

jakevdp commented Oct 8, 2015

You're going to have to further convince me that it's correct to return 0 rather than NaN in these cases. Mathematically, the stated formula gives 0/0 -> NaN. Conceptually, the correlation distance is a measure of the linear relationship between two variables, and my intuition is that if there is no baseline to measure against then the question of linear relationship is meaningless, and NaN is as good a "missing data" marker as we have.

@ev-br
Copy link
Copy Markdown
Member

ev-br commented Oct 16, 2015

OK, there are two apparent -1s, and no action from OP for a week. FTR and FWIW, I agree with Jake and Alex above. The stats.stackexchange link quoted as a justification has a bit more discussion and I fail to read it as a full justification of the suggestion here. IMO, a reference to well-reputed literature could sway it. Otherwise, I'd recommend rejecting and closing this PR.

@josef-pkt
Copy link
Copy Markdown
Member

Note, if this were in stats, I would have recommended "don't fix", similar to the discussion for the t-test (where I was on the no nan side and got convinced otherwise).

In spatial it could be a bit more pragmatic, IMO, but I don't see a convincing argument for any specific value as alternative to nan.

@giorgiop
Copy link
Copy Markdown
Author

I have stopped working on this, waiting for more opinions, and decide to "fix" those issues on the scikit-learn side meanwhile.

I am good with your decision, if you want to close the PR. I suggest that those limit behaviours get documented in the docstrings though.

@giorgiop
Copy link
Copy Markdown
Author

If you agree to touch the documentation of the distance functions, I can work on that.

@ev-br
Copy link
Copy Markdown
Member

ev-br commented Nov 9, 2015

I think clarifying the documentation would be great @giorgiop. Either as a replacement of this PR (then please also change the title), or close this PR and send another one for the documentation, whichever you prefer. Thanks!

@ev-br ev-br added the needs-work Items that are pending response from the author label Nov 9, 2015
@ev-br
Copy link
Copy Markdown
Member

ev-br commented Jan 18, 2016

Ok, the agreement back in October seemed to be that we'd rather keep the current behavior, and there is no activity since then. I'm then closing this as a wontfix. If at some point in time we want to reconsider, the patch is still available. Thank you @giorgiop, thanks to reviewers.

@ev-br ev-br closed this Jan 18, 2016
@ev-br ev-br added wontfix Not actionable, rejected or unplanned changes and removed needs-work Items that are pending response from the author labels Jan 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial wontfix Not actionable, rejected or unplanned changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants