[MRG] pairwise_distances outputs Nan and negative values#5333
[MRG] pairwise_distances outputs Nan and negative values#5333giorgiop wants to merge 2 commits intoscikit-learn:masterfrom
Conversation
1710ef6 to
5fce049
Compare
|
PR on scipy (it has been closed) |
|
I should have covered all the problematic |
|
Hi, SciPy already handles these errors in its C-loop, replacing the values with As far as it's possible, I don't think we should be adding code which duplicates that in scipy. |
|
I agree. But the problem I see with such a post-process check is that we do not have a way to distinguish between constant inputs and genuine bad inputs. For example: |
|
Ah, I see. In that case, I think we should wrap the scipy functionality to do both an input validation and output adjustment. Perhaps via function like |
There was a problem hiding this comment.
Better not rely on the np.random singleton to ensure thread safety:
rng = np.random.RandomState(42)
X = rng.rand(10, 3)
@giorgiop can you please run a quick benchmark to check the impact of your current solution on the runtime performance? |
|
Here you are. I think @jakevdp's solution is the way to go. |
Unless we call scipy's distance on each pair of vectors (current slow solution), we cannot catch the errors. But we can validate the input, that is, check if the vectors are constant, and then fix the output accordingly. |
Indeed.
That's worth trying and benchmarking to see what is the resulting overhead of this strategy. |
|
New benchmark. Slow down only x2 |
8c768d0 to
0587a7a
Compare
sklearn/metrics/pairwise.py
Outdated
There was a problem hiding this comment.
I'd just put all this code in _filter_binary_pairwise_output. No need for the indirection to another function, since it's so short and only used once.
0587a7a to
131a129
Compare
|
@jakevdp auxiliary function is avoided, thanks. |
There was a problem hiding this comment.
This should be ~non_const_row, yes? I believe that as written it would lead to a value error. Also probably indicates that tests aren't covering this branch of the conditional.
|
I'm still a bit worried about this conceptually: that is, if the stated definition of, say, yule distance evaluates to 0/0 for a given input, should we really set the NaN result to zero in all cases? I'm not convinced that this choice won't cause problems in other applications. Thoughts, @ogrisel? |
e556352 to
7964ec6
Compare
|
For yule, you only get 0/0 if both vectors are constant False, right? [there is not wikipedia entry for this distance?] |
0.16 doc (it was wrong in 0.14, still the first result from Google). I think we are dividing by 0 anytime |
e5b386b to
4a433f8
Compare
accaff3 to
fcc043e
Compare
|
Summary of the changes.
|
fcc043e to
6643a5a
Compare
|
With regard to binary distances, bad output is filtered only when the function is not well-defined on the input vectors:
|
|
Could you please have a look and tell me what you think? @jakevdp |
7f63a87 to
c36586b
Compare
|
The PR should address also #5772 (comment) |
8302206 to
3844974
Compare
|
The touch to |
c75551d to
66053e6
Compare
|
@jakevdp could you have a look at this when you have time? I haven't touched the PR in a while and it's good to see with a fresh mind whether those changes still make sense. |
|
@giorgiop are you closing this just because it's stagnated? or some other reason? |
|
I have never been fully convinced this was the right approach to solve those numerical issues. I believe this fix was judged as required for the last release but that it was left out for lack of time in reviewing. Interest on this seems to have faded out anyway so I wanted to close. I believe a fresh brand new approach may be better than trying to interpret what I did here. Let me know if you want to me re-open anyway. |
Fixes #4475 . The problem is about pairwise distances and not T-SNE. Refer to the issue for discussion.
#4495 deals with the same issue but it does not seem active any more.
Tests for problems related with negative values now pass. Regarding
nanI am going to see if a fix on the scipy's side is possible. Until then, I have written wrappers of the scipy's functions.Changes
TSNEas from TSNE with correlation metric: ValueError: Distance matrix 'X' must be symmetric #4475 but covering all distances fromsklearn.metrics.pairwise.pairwise_distancespairwise_distancescorrelation, almost the same ascosine