[MRG] pairwise_distances outputs Nan and negative values by giorgiop · Pull Request #5333 · scikit-learn/scikit-learn

giorgiop · 2015-10-01T09:27:59Z

Fixes #4475 . The problem is about pairwise distances and not T-SNE. Refer to the issue for discussion.
#4495 deals with the same issue but it does not seem active any more.

Tests for problems related with negative values now pass. Regarding nan I am going to see if a fix on the scipy's side is possible. Until then, I have written wrappers of the scipy's functions.

Changes

added breaking tests for TSNE as from TSNE with correlation metric: ValueError: Distance matrix 'X' must be symmetric #4475 but covering all distances from sklearn.metrics.pairwise.pairwise_distances
added breaking tests for the same pairwise_distances
implemented a robust sklearn's version of correlation, almost the same as cosine
wrote wrappers for scipy's 'yule', 'dice', 'sokalsneath'

giorgiop · 2015-10-01T14:09:17Z

PR on scipy (it has been closed)

giorgiop · 2015-10-06T07:28:21Z

I should have covered all the problematic pairwise_distances.

jakevdp · 2015-10-07T00:14:21Z

Hi,
Thanks for taking a look at this. I read through this PR quickly, and have one major concern: the nested for loops over all the data within the _scipy_wrapper_binary_distances() function. That will be far slower than the current implementation.

SciPy already handles these errors in its C-loop, replacing the values with NaN. It would be much faster and cleaner to do e.g. K[np.isnan(K)] = 0.0 or K[K < 0] = 0, depending on the case.

As far as it's possible, I don't think we should be adding code which duplicates that in scipy.

giorgiop · 2015-10-07T08:35:20Z

I agree. But the problem I see with such a post-process check is that we do not have a way to distinguish between constant inputs and genuine bad inputs. For example:

import numpy as np
from sklearn.metrics import pairwise_distances

X1 = np.array([[0,1,0], [0,1,1]])
X2 = np.array([[1,1,1],[0,0,1]])
X3 = X1 + 0.05 # floats, but binary string expected

pairwise_distances(X1,X2, 'yule')
array([[ nan,   2.],
       [ nan,   0.]])
pairwise_distances(X1,X3, 'yule')
array([[ nan,  nan],
       [ nan,  nan]])

jakevdp · 2015-10-07T14:34:17Z

Ah, I see. In that case, I think we should wrap the scipy functionality to do both an input validation and output adjustment. Perhaps via function like validate_pairwise_input(X, Y, metric) and filter_pairwise_output(D, metric) or something similar, called at the beginning and end of pairwise_distances.

ogrisel · 2015-10-07T16:42:11Z

sklearn/manifold/tests/test_t_sne.py

Better not rely on the np.random singleton to ensure thread safety:

rng = np.random.RandomState(42) X = rng.rand(10, 3)

ogrisel · 2015-10-07T16:59:51Z

I read through this PR quickly, and have one major concern: the nested for loops over all the data within the _scipy_wrapper_binary_distances() function. That will be far slower than the current implementation.

@giorgiop can you please run a quick benchmark to check the impact of your current solution on the runtime performance?

giorgiop · 2015-10-08T07:17:49Z

Here you are.

import numpy as np
from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import pairwise_distances

X = np.random.RandomState(0).randint(0, 2, size=(100, 500))

%timeit -n 10 pdist(X, metric='yule')
10 loops, best of 3: 21.2 ms per loop

# Master branch
%timeit -n 10 pairwise_distances(X, metric='yule')
10 loops, best of 3: 21.4 ms per loop

# This branch
%timeit -n 10 pairwise_distances(X, metric='yule')
10 loops, best of 3: 496 ms per loop

I think @jakevdp's solution is the way to go.

giorgiop · 2015-10-08T08:01:01Z

Ah, I see. In that case, I think we should wrap the scipy functionality to do both an input validation and output adjustment. Perhaps via function like validate_pairwise_input(X, Y, metric) and filter_pairwise_output(D, metric) or something similar, called at the beginning and end of pairwise_distances.

Unless we call scipy's distance on each pair of vectors (current slow solution), we cannot catch the errors. But we can validate the input, that is, check if the vectors are constant, and then fix the output accordingly.

ogrisel · 2015-10-08T08:19:36Z

I think @jakevdp's solution is the way to go.

Indeed.

But we can validate the input, that is, check if the vectors are constant, and then fix the output accordingly.

That's worth trying and benchmarking to see what is the resulting overhead of this strategy.

giorgiop · 2015-10-08T11:51:09Z

New benchmark. Slow down only x2

# This branch
%timeit -n 50 pairwise_distances(X, metric='yule')
50 loops, best of 3: 43.4 ms per loop

jakevdp · 2015-10-08T13:32:00Z

sklearn/metrics/pairwise.py

I'd just put all this code in _filter_binary_pairwise_output. No need for the indirection to another function, since it's so short and only used once.

giorgiop · 2015-10-08T14:16:33Z

@jakevdp auxiliary function is avoided, thanks.

# This branch
%timeit -n 50 pairwise_distances(X, metric='yule')
50 loops, best of 3: 42.4 ms per loop

jakevdp · 2015-10-08T16:02:56Z

sklearn/metrics/pairwise.py

This should be ~non_const_row, yes? I believe that as written it would lead to a value error. Also probably indicates that tests aren't covering this branch of the conditional.

jakevdp · 2015-10-08T16:10:20Z

I'm still a bit worried about this conceptually: that is, if the stated definition of, say, yule distance evaluates to 0/0 for a given input, should we really set the NaN result to zero in all cases? I'm not convinced that this choice won't cause problems in other applications. Thoughts, @ogrisel?

amueller · 2015-10-09T18:07:53Z

For yule, you only get 0/0 if both vectors are constant False, right? [there is not wikipedia entry for this distance?]
Should we warn on constant rows?

giorgiop · 2015-10-15T09:12:15Z

For yule, you only get 0/0 if both vectors are constant False, right? [there is not wikipedia entry for this distance?]

0.16 doc (it was wrong in 0.14, still the first result from Google). I think we are dividing by 0 anytime ~~the two vectors are both constant~~ one vector is constant. But let me test (on master):

>>> import numpy as np
>>> from scipy.spatial.distance import cdist
>>> from sklearn.metrics.pairwise import pairwise_distances

>>> X1 = np.zeros((3, 3))
>>> X1[1, :] = 1
>>> X1
array([[ 0.,  0.,  0.],
       [ 1.,  1.,  1.],
       [ 0.,  0.,  0.]])
>>> X2 = np.diag([1, 1, 1])

>>> cdist(X1, X1, 'yule')
array([[ nan,  nan,  nan],
       [ nan,  nan,  nan],
       [ nan,  nan,  nan]])
>>> cdist(X1, X2, 'yule')
array([[ nan,  nan,  nan],
       [ nan,  nan,  nan],
       [ nan,  nan,  nan]])
>>> pairwise_distances(X1, X1, metric='yule')
array([[ 0,  nan,  nan],
       [ nan,  0,  nan],
       [ nan,  nan,  0]])
>>> pairwise_distances(X1, X2, metric='yule')
array([[ nan,  nan,  nan],
       [ nan,  nan,  nan],
       [ nan,  nan,  nan]])

giorgiop · 2015-11-05T10:29:19Z

Summary of the changes.

preprocessig.data._handle_zeros_in_scale now handles also "almost" constant vectors, the ones for which scale is very close to 0 in term of machine precision. This extend the solution for numerical issues while scaling.
proprocessing.data.normalize can disable this feature by a new param. This is necessary in computing _correlation_distances: we do not want to handle silently those numerical issues all over the code base, but only when required by the application.
the helper _clip_to_zero set to 0 distances when negative or positive but close to machine precision. If requested, the function also set to 0 the diagonal when X is Y (distance between the vector and itself)
_correlation_similarity is now a built-in. It's all the same of scipy.spatial but it calls _clip_to_zero internally to cut negative and small positive distances. It still return nan when called on constant vectors (unless they are the same vector), accordingly to scipy semantic. When called with support_constants=True, constant vectors are taken into account. Remark: I did not implement an agnostic filter for nan values: garbage in, garbage out is still in place. We only want to support constant input (for which the distance would be undefined) and numerical issues due to floating point arithmetic.

giorgiop · 2015-11-05T11:46:23Z

With regard to binary distances, bad output is filtered only when the function is not well-defined on the input vectors:

'yule' => at least one vector is constant
'dice' and 'sokalsneath' => both vectors are 0

giorgiop · 2015-11-05T11:47:34Z

Could you please have a look and tell me what you think? @jakevdp

giorgiop · 2015-11-09T17:52:55Z

The PR should address also #5772 (comment)

giorgiop · 2015-11-13T15:17:32Z

The touch to test_gmm apparently fixed a non-deterministic test that was breaking under python=2.6, which I am not able to reproduce locally. It appears to me unrelated to this PR. Honestly I am not sure or can say with certainty that this change is the right one, but I haven't experienced problem with that test anymore.

giorgiop · 2016-02-21T22:33:44Z

@jakevdp could you have a look at this when you have time? I haven't touched the PR in a while and it's good to see with a fresh mind whether those changes still make sense.

jnothman · 2016-08-22T07:00:06Z

@giorgiop are you closing this just because it's stagnated? or some other reason?

giorgiop · 2016-08-22T07:23:34Z

I have never been fully convinced this was the right approach to solve those numerical issues. I believe this fix was judged as required for the last release but that it was left out for lack of time in reviewing. Interest on this seems to have faded out anyway so I wanted to close. I believe a fresh brand new approach may be better than trying to interpret what I did here.

Let me know if you want to me re-open anyway.

giorgiop force-pushed the t-sne-zero-var branch from 1710ef6 to 5fce049 Compare October 1, 2015 11:40

ogrisel added this to the 0.17 milestone Oct 7, 2015

ogrisel reviewed Oct 7, 2015
View reviewed changes

giorgiop force-pushed the t-sne-zero-var branch 4 times, most recently from 8c768d0 to 0587a7a Compare October 8, 2015 12:17

jakevdp reviewed Oct 8, 2015
View reviewed changes

giorgiop force-pushed the t-sne-zero-var branch from 0587a7a to 131a129 Compare October 8, 2015 14:12

jakevdp reviewed Oct 8, 2015
View reviewed changes

giorgiop force-pushed the t-sne-zero-var branch 4 times, most recently from e556352 to 7964ec6 Compare October 9, 2015 12:21

giorgiop force-pushed the t-sne-zero-var branch from e5b386b to 4a433f8 Compare October 28, 2015 14:30

giorgiop force-pushed the t-sne-zero-var branch 2 times, most recently from accaff3 to fcc043e Compare November 5, 2015 09:30

giorgiop force-pushed the t-sne-zero-var branch from fcc043e to 6643a5a Compare November 5, 2015 11:37

giorgiop force-pushed the t-sne-zero-var branch 3 times, most recently from 7f63a87 to c36586b Compare November 5, 2015 13:31

giorgiop mentioned this pull request Nov 9, 2015

Error thrown when using t-SNE with 'cosine' as metric #5772

Closed

giorgiop force-pushed the t-sne-zero-var branch 4 times, most recently from 8302206 to 3844974 Compare November 13, 2015 10:05

amueller added the Waiting for Reviewer label Dec 10, 2015

giorgiop added 2 commits February 22, 2016 09:19

breaking tests added

ca71366

temporary fix for correlation

66053e6

giorgiop force-pushed the t-sne-zero-var branch from c75551d to 66053e6 Compare February 21, 2016 22:23

lesteve modified the milestones: 0.17, 0.18 Jul 27, 2016

giorgiop closed this Aug 22, 2016

giorgiop mentioned this pull request Oct 24, 2016

[MRG + 1] fix bug with negative values in cosine_distances #7732

Merged

Uh oh!

Conversation

giorgiop commented Oct 1, 2015

Uh oh!

giorgiop commented Oct 1, 2015

Uh oh!

giorgiop commented Oct 6, 2015

Uh oh!

jakevdp commented Oct 7, 2015

Uh oh!

giorgiop commented Oct 7, 2015

Uh oh!

jakevdp commented Oct 7, 2015

Uh oh!

ogrisel Oct 7, 2015

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Oct 7, 2015

Uh oh!

giorgiop commented Oct 8, 2015

Uh oh!

giorgiop commented Oct 8, 2015

Uh oh!

ogrisel commented Oct 8, 2015

Uh oh!

giorgiop commented Oct 8, 2015

Uh oh!

jakevdp Oct 8, 2015

Choose a reason for hiding this comment

Uh oh!

giorgiop commented Oct 8, 2015

Uh oh!

jakevdp Oct 8, 2015

Choose a reason for hiding this comment

Uh oh!

jakevdp commented Oct 8, 2015

Uh oh!

amueller commented Oct 9, 2015

Uh oh!

giorgiop commented Oct 15, 2015

Uh oh!

giorgiop commented Nov 5, 2015

Uh oh!

giorgiop commented Nov 5, 2015

Uh oh!

giorgiop commented Nov 5, 2015

Uh oh!

giorgiop commented Nov 9, 2015

Uh oh!

giorgiop commented Nov 13, 2015

Uh oh!

giorgiop commented Feb 21, 2016

Uh oh!

jnothman commented Aug 22, 2016

Uh oh!

giorgiop commented Aug 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants