BUG: handle corner cases for correlation#5301
Conversation
7dceeda to
b90eb60
Compare
|
The problem extends to input vector that are almost constant, due to finite arithmetic issues in the division by zero. The script outputs (on master branch) |
There was a problem hiding this comment.
You are iterating over each array one more time by calling ptp on them. Wouldn't it be better to store norm(um) and norm(vm), and do the check on them? Also, if both vectors are constant, is the correlation still 0 or is it 1 then?
There was a problem hiding this comment.
Strictly speaking, the correlation is not defined. But correlation is a measure based on covariance, and covariance between anything and a constant is always 0. It might sounds paradoxical, but even the same constant vector has covariance 0 with itself. The division by zero comes from the normalization of this measure by the variance (which is 0 as well).
|
It's not clear to me that "BUG: handle corner case" really applies here. The correct limit of the output as the input approaches the corner depends on the direction from which it is approaching, right? I think this is a classic reason to use NaN, analogous to the reasoning behind NaN for x/y as x->0 and y->0. |
I agree. Although, I think that most if not all use cases of correlation have to check whether a In case we do not find this "bug fix" appropriate on the scipy side, I would like |
|
You're going to have to further convince me that it's correct to return 0 rather than NaN in these cases. Mathematically, the stated formula gives 0/0 -> NaN. Conceptually, the correlation distance is a measure of the linear relationship between two variables, and my intuition is that if there is no baseline to measure against then the question of linear relationship is meaningless, and NaN is as good a "missing data" marker as we have. |
|
OK, there are two apparent -1s, and no action from OP for a week. FTR and FWIW, I agree with Jake and Alex above. The stats.stackexchange link quoted as a justification has a bit more discussion and I fail to read it as a full justification of the suggestion here. IMO, a reference to well-reputed literature could sway it. Otherwise, I'd recommend rejecting and closing this PR. |
|
Note, if this were in stats, I would have recommended "don't fix", similar to the discussion for the t-test (where I was on the no nan side and got convinced otherwise). In spatial it could be a bit more pragmatic, IMO, but I don't see a convincing argument for any specific value as alternative to nan. |
|
I have stopped working on this, waiting for more opinions, and decide to "fix" those issues on the scikit-learn side meanwhile. I am good with your decision, if you want to close the PR. I suggest that those limit behaviours get documented in the docstrings though. |
|
If you agree to touch the documentation of the distance functions, I can work on that. |
|
I think clarifying the documentation would be great @giorgiop. Either as a replacement of this PR (then please also change the title), or close this PR and send another one for the documentation, whichever you prefer. Thanks! |
|
Ok, the agreement back in October seemed to be that we'd rather keep the current behavior, and there is no activity since then. I'm then closing this as a wontfix. If at some point in time we want to reconsider, the patch is still available. Thank you @giorgiop, thanks to reviewers. |
Following discussion from /scikit-learn#4475, it would be nice to have corner cases handled inside scipy's
correlationdirectly. Although correlation is not mathematically defined for constant input vector, it makes sense to complete the definition at the limits of vectors with zero variance as 1, the middle point for the domain of the correlation distance. One vector's lack of variance does not say anything about any other vector variation.The choice is also justified by the fact that if one's variance is zero, it follows that the covariance is also 0, and therefore correlation distance is 1.
I did not observe issues with other distance functions, except
yule.