Skip to content

[MRG] pairwise_distances outputs Nan and negative values#5333

Closed
giorgiop wants to merge 2 commits intoscikit-learn:masterfrom
giorgiop:t-sne-zero-var
Closed

[MRG] pairwise_distances outputs Nan and negative values#5333
giorgiop wants to merge 2 commits intoscikit-learn:masterfrom
giorgiop:t-sne-zero-var

Conversation

@giorgiop
Copy link
Copy Markdown
Contributor

@giorgiop giorgiop commented Oct 1, 2015

Fixes #4475 . The problem is about pairwise distances and not T-SNE. Refer to the issue for discussion.
#4495 deals with the same issue but it does not seem active any more.

Tests for problems related with negative values now pass. Regarding nan I am going to see if a fix on the scipy's side is possible. Until then, I have written wrappers of the scipy's functions.

Changes

@giorgiop
Copy link
Copy Markdown
Contributor Author

giorgiop commented Oct 1, 2015

PR on scipy (it has been closed)

@giorgiop
Copy link
Copy Markdown
Contributor Author

giorgiop commented Oct 6, 2015

I should have covered all the problematic pairwise_distances.

@jakevdp
Copy link
Copy Markdown
Member

jakevdp commented Oct 7, 2015

Hi,
Thanks for taking a look at this. I read through this PR quickly, and have one major concern: the nested for loops over all the data within the _scipy_wrapper_binary_distances() function. That will be far slower than the current implementation.

SciPy already handles these errors in its C-loop, replacing the values with NaN. It would be much faster and cleaner to do e.g. K[np.isnan(K)] = 0.0 or K[K < 0] = 0, depending on the case.

As far as it's possible, I don't think we should be adding code which duplicates that in scipy.

@giorgiop
Copy link
Copy Markdown
Contributor Author

giorgiop commented Oct 7, 2015

I agree. But the problem I see with such a post-process check is that we do not have a way to distinguish between constant inputs and genuine bad inputs. For example:

import numpy as np
from sklearn.metrics import pairwise_distances

X1 = np.array([[0,1,0], [0,1,1]])
X2 = np.array([[1,1,1],[0,0,1]])
X3 = X1 + 0.05 # floats, but binary string expected

pairwise_distances(X1,X2, 'yule')
array([[ nan,   2.],
       [ nan,   0.]])
pairwise_distances(X1,X3, 'yule')
array([[ nan,  nan],
       [ nan,  nan]])

@jakevdp
Copy link
Copy Markdown
Member

jakevdp commented Oct 7, 2015

Ah, I see. In that case, I think we should wrap the scipy functionality to do both an input validation and output adjustment. Perhaps via function like validate_pairwise_input(X, Y, metric) and filter_pairwise_output(D, metric) or something similar, called at the beginning and end of pairwise_distances.

@ogrisel ogrisel added this to the 0.17 milestone Oct 7, 2015
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better not rely on the np.random singleton to ensure thread safety:

rng =  np.random.RandomState(42)
X = rng.rand(10, 3)

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Oct 7, 2015

I read through this PR quickly, and have one major concern: the nested for loops over all the data within the _scipy_wrapper_binary_distances() function. That will be far slower than the current implementation.

@giorgiop can you please run a quick benchmark to check the impact of your current solution on the runtime performance?

@giorgiop
Copy link
Copy Markdown
Contributor Author

giorgiop commented Oct 8, 2015

Here you are.

import numpy as np
from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import pairwise_distances

X = np.random.RandomState(0).randint(0, 2, size=(100, 500))

%timeit -n 10 pdist(X, metric='yule')
10 loops, best of 3: 21.2 ms per loop

# Master branch
%timeit -n 10 pairwise_distances(X, metric='yule')
10 loops, best of 3: 21.4 ms per loop

# This branch
%timeit -n 10 pairwise_distances(X, metric='yule')
10 loops, best of 3: 496 ms per loop

I think @jakevdp's solution is the way to go.

@giorgiop
Copy link
Copy Markdown
Contributor Author

giorgiop commented Oct 8, 2015

Ah, I see. In that case, I think we should wrap the scipy functionality to do both an input validation and output adjustment. Perhaps via function like validate_pairwise_input(X, Y, metric) and filter_pairwise_output(D, metric) or something similar, called at the beginning and end of pairwise_distances.

Unless we call scipy's distance on each pair of vectors (current slow solution), we cannot catch the errors. But we can validate the input, that is, check if the vectors are constant, and then fix the output accordingly.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Oct 8, 2015

I think @jakevdp's solution is the way to go.

Indeed.

But we can validate the input, that is, check if the vectors are constant, and then fix the output accordingly.

That's worth trying and benchmarking to see what is the resulting overhead of this strategy.

@giorgiop
Copy link
Copy Markdown
Contributor Author

giorgiop commented Oct 8, 2015

New benchmark. Slow down only x2

# This branch
%timeit -n 50 pairwise_distances(X, metric='yule')
50 loops, best of 3: 43.4 ms per loop

@giorgiop giorgiop force-pushed the t-sne-zero-var branch 4 times, most recently from 8c768d0 to 0587a7a Compare October 8, 2015 12:17
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just put all this code in _filter_binary_pairwise_output. No need for the indirection to another function, since it's so short and only used once.

@giorgiop
Copy link
Copy Markdown
Contributor Author

giorgiop commented Oct 8, 2015

@jakevdp auxiliary function is avoided, thanks.

# This branch
%timeit -n 50 pairwise_distances(X, metric='yule')
50 loops, best of 3: 42.4 ms per loop

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be ~non_const_row, yes? I believe that as written it would lead to a value error. Also probably indicates that tests aren't covering this branch of the conditional.

@jakevdp
Copy link
Copy Markdown
Member

jakevdp commented Oct 8, 2015

I'm still a bit worried about this conceptually: that is, if the stated definition of, say, yule distance evaluates to 0/0 for a given input, should we really set the NaN result to zero in all cases? I'm not convinced that this choice won't cause problems in other applications. Thoughts, @ogrisel?

@giorgiop giorgiop force-pushed the t-sne-zero-var branch 4 times, most recently from e556352 to 7964ec6 Compare October 9, 2015 12:21
@amueller
Copy link
Copy Markdown
Member

amueller commented Oct 9, 2015

For yule, you only get 0/0 if both vectors are constant False, right? [there is not wikipedia entry for this distance?]
Should we warn on constant rows?

@giorgiop
Copy link
Copy Markdown
Contributor Author

For yule, you only get 0/0 if both vectors are constant False, right? [there is not wikipedia entry for this distance?]

0.16 doc (it was wrong in 0.14, still the first result from Google). I think we are dividing by 0 anytime the two vectors are both constant one vector is constant. But let me test (on master):

>>> import numpy as np
>>> from scipy.spatial.distance import cdist
>>> from sklearn.metrics.pairwise import pairwise_distances

>>> X1 = np.zeros((3, 3))
>>> X1[1, :] = 1
>>> X1
array([[ 0.,  0.,  0.],
       [ 1.,  1.,  1.],
       [ 0.,  0.,  0.]])
>>> X2 = np.diag([1, 1, 1])

>>> cdist(X1, X1, 'yule')
array([[ nan,  nan,  nan],
       [ nan,  nan,  nan],
       [ nan,  nan,  nan]])
>>> cdist(X1, X2, 'yule')
array([[ nan,  nan,  nan],
       [ nan,  nan,  nan],
       [ nan,  nan,  nan]])
>>> pairwise_distances(X1, X1, metric='yule')
array([[ 0,  nan,  nan],
       [ nan,  0,  nan],
       [ nan,  nan,  0]])
>>> pairwise_distances(X1, X2, metric='yule')
array([[ nan,  nan,  nan],
       [ nan,  nan,  nan],
       [ nan,  nan,  nan]])

@giorgiop giorgiop force-pushed the t-sne-zero-var branch 2 times, most recently from accaff3 to fcc043e Compare November 5, 2015 09:30
@giorgiop
Copy link
Copy Markdown
Contributor Author

giorgiop commented Nov 5, 2015

Summary of the changes.

  • preprocessig.data._handle_zeros_in_scale now handles also "almost" constant vectors, the ones for which scale is very close to 0 in term of machine precision. This extend the solution for numerical issues while scaling.
  • proprocessing.data.normalize can disable this feature by a new param. This is necessary in computing _correlation_distances: we do not want to handle silently those numerical issues all over the code base, but only when required by the application.
  • the helper _clip_to_zero set to 0 distances when negative or positive but close to machine precision. If requested, the function also set to 0 the diagonal when X is Y (distance between the vector and itself)
  • _correlation_similarity is now a built-in. It's all the same of scipy.spatial but it calls _clip_to_zero internally to cut negative and small positive distances. It still return nan when called on constant vectors (unless they are the same vector), accordingly to scipy semantic. When called with support_constants=True, constant vectors are taken into account. Remark: I did not implement an agnostic filter for nan values: garbage in, garbage out is still in place. We only want to support constant input (for which the distance would be undefined) and numerical issues due to floating point arithmetic.

@giorgiop
Copy link
Copy Markdown
Contributor Author

giorgiop commented Nov 5, 2015

With regard to binary distances, bad output is filtered only when the function is not well-defined on the input vectors:

  • 'yule' => at least one vector is constant
  • 'dice' and 'sokalsneath' => both vectors are 0

@giorgiop
Copy link
Copy Markdown
Contributor Author

giorgiop commented Nov 5, 2015

Could you please have a look and tell me what you think? @jakevdp

@giorgiop
Copy link
Copy Markdown
Contributor Author

giorgiop commented Nov 9, 2015

The PR should address also #5772 (comment)

@giorgiop giorgiop force-pushed the t-sne-zero-var branch 4 times, most recently from 8302206 to 3844974 Compare November 13, 2015 10:05
@giorgiop
Copy link
Copy Markdown
Contributor Author

The touch to test_gmm apparently fixed a non-deterministic test that was breaking under python=2.6, which I am not able to reproduce locally. It appears to me unrelated to this PR. Honestly I am not sure or can say with certainty that this change is the right one, but I haven't experienced problem with that test anymore.

@giorgiop
Copy link
Copy Markdown
Contributor Author

@jakevdp could you have a look at this when you have time? I haven't touched the PR in a while and it's good to see with a fresh mind whether those changes still make sense.

@lesteve lesteve modified the milestones: 0.17, 0.18 Jul 27, 2016
@giorgiop giorgiop closed this Aug 22, 2016
@jnothman
Copy link
Copy Markdown
Member

@giorgiop are you closing this just because it's stagnated? or some other reason?

@giorgiop
Copy link
Copy Markdown
Contributor Author

I have never been fully convinced this was the right approach to solve those numerical issues. I believe this fix was judged as required for the last release but that it was left out for lack of time in reviewing. Interest on this seems to have faded out anyway so I wanted to close. I believe a fresh brand new approach may be better than trying to interpret what I did here.

Let me know if you want to me re-open anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TSNE with correlation metric: ValueError: Distance matrix 'X' must be symmetric

7 participants