[MRG + 1] fix bug with negative values in cosine_distances by asanakoy · Pull Request #7732 · scikit-learn/scikit-learn

asanakoy · 2016-10-24T02:22:40Z

Reference Issue

Fixes #5772

What does this implement/fix? Explain your changes.

Fix bug cosine_distances returning small negative values.

Essentially:

clip cosine distances to [0, 2].
set distances between vectors and themselves to 0.

Any other comments?

Did it analogously as it was implemented in euclidean_distances.

clip distances to [0, 2] set distances between vectors and themselves to 0

giorgiop · 2016-10-24T02:28:18Z

I am in line with this solution (I did the same in a bigger, never merged PR #5333).
Could you please add a test that would break without your changes?

asanakoy · 2016-10-24T02:30:39Z

Sure. Should I add it to sklearn/metrics/tests/test_pairwise.py as a separate function?

giorgiop · 2016-10-24T02:32:46Z

I would put some both in parwise and T-SNE. You should be able to reuse some of the tests I wrote back then. But please read them critically :)

asanakoy · 2016-10-24T03:15:33Z

The test will fail without the fix. It generates distances less than 0 and greater than 2.

amueller · 2016-10-24T16:21:19Z

sklearn/metrics/tests/test_pairwise.py

+    D = cosine_distances(XA)
+    assert_array_almost_equal(D, [[0., 0.], [0., 0.]])
+    # check that all elements are in [0, 2]
+    assert_true(np.all(D >= 0.))


This seems pretty vacuous after the test that all entries are zero? Maybe tests these with a larger random matrix?

This is the specific random vectors that I found to produce negative values on diagonal if you don't clip them to [0,2].
I can add another test with

X = np.abs(rng.rand(1000, 5000)) D = cosine_distances(X) # we get precisely 0 only if we clip previously negative values on the diagonal. So we are not sure here. assert_array_almost_equal(D.flat[::D.shape[0] + 1], [0., 0.]) assert_true(np.all(D >= 0.)) assert_true(np.all(D <= 2.))

I would like that. It's not saying that your test doesn't make sense, but after the first assert_array_almost_equal it is certain that the tests pass, right? I guess the >=0 checks if it's actually >= and not only "almost" but if it's "almost zero" it is certainly smaller than two.

On the other hand checking both boundaries for both cases is fine. But having a large random matrix in addition would also be nice.

I guess the >=0 checks if it's actually >= and not only "almost"

Absolutely. If the value is -1e-16 than it will pass assert_array_almost_equal, but the second check on >=0 will make it fail.

OK. I will add another test with random matrix 1000 x 5000 .

asanakoy · 2016-10-24T16:53:00Z

sklearn/metrics/tests/test_pairwise.py

+    D = cosine_distances(XA)
+    assert_array_almost_equal(D, [[0., 0.], [0., 0.]])
+    # check that all elements are in [0, 2]
+    assert_true(np.all(D >= 0.))


This is the specific random vectors that I found to produce negative values on diagonal if you don't clip them to [0,2].
I can add another test with

X = np.abs(rng.rand(1000, 5000)) D = cosine_distances(X) # we get precisely 0 only if we clip previously negative values on the diagonal. So we are not sure here. assert_array_almost_equal(D.flat[::D.shape[0] + 1], [0., 0.]) assert_true(np.all(D >= 0.)) assert_true(np.all(D <= 2.))

asanakoy · 2016-10-24T16:54:29Z

sklearn/metrics/tests/test_pairwise.py

+    assert_array_equal(D.flat[::D.shape[0] + 1], [0., 0.])
+
+    XB = np.vstack([x, -x])
+    D2 = cosine_distances(XB)


For this specific vectors we will have values < 0 on the diagonal and > 2.0 off diagonal if you don't clip them to [0,2].

amueller · 2016-10-24T20:25:04Z

LGTM, though you have a pep8 violation

asanakoy · 2016-10-24T20:31:51Z

Sorry. Repushed ;)

NelleV · 2016-10-25T13:20:21Z

sklearn/metrics/pairwise.py

+    if X is Y or Y is None:
+        # Ensure that distances between vectors and themselves are set to 0.0.
+        # This may not be the case due to floating point rounding errors.
+        S.flat[::S.shape[0] + 1] = 0.0


If I am not mistaken, this case leads to a squared matrix. Why not use np.diag_indices? It seems much more clear to me what the code does using the diagonal of the matrix instead of the slicing.

NelleV · 2016-10-26T02:56:52Z

Thanks for putting up with my annoying nitpick :)

asanakoy · 2016-10-26T03:01:31Z

No problem! Thank you!
My first contribution to sklearn :) I hope to be helpful further!

amueller · 2016-10-26T19:20:30Z

Great, congratulations @asanakoy :)

…arn#7732) * fix bug with negative values in cosine_distances clip distances to [0, 2] set distances between vectors and themselves to 0 * add test * add test on big random matrix * use np.diag_indices_from instead of slicing

fix bug with negative values in cosine_distances

105cba7

clip distances to [0, 2] set distances between vectors and themselves to 0

asanakoy mentioned this pull request Oct 24, 2016

TSNE with correlation metric: ValueError: Distance matrix 'X' must be symmetric #4475

Closed

add test

0713cc2

asanakoy force-pushed the fix-negative-cosine-dist branch from 3aed213 to 0713cc2 Compare October 24, 2016 03:19

amueller reviewed Oct 24, 2016

View reviewed changes

amueller changed the title ~~fix bug with negative values in cosine_distances~~ [MRG] fix bug with negative values in cosine_distances Oct 24, 2016

asanakoy commented Oct 24, 2016

View reviewed changes

amueller changed the title ~~[MRG] fix bug with negative values in cosine_distances~~ [MRG + 1] fix bug with negative values in cosine_distances Oct 24, 2016

add test on big random matrix

d5d290b

asanakoy force-pushed the fix-negative-cosine-dist branch from 6814010 to d5d290b Compare October 24, 2016 20:30

amueller added this to the 0.18.1 milestone Oct 24, 2016

NelleV reviewed Oct 25, 2016

View reviewed changes

use np.diag_indices_from instead of slicing

3ee77e7

asanakoy force-pushed the fix-negative-cosine-dist branch from c2d0fe7 to 3ee77e7 Compare October 26, 2016 00:42

NelleV merged commit 0ea8e8b into scikit-learn:master Oct 26, 2016

asanakoy deleted the fix-negative-cosine-dist branch October 26, 2016 19:24

asanakoy restored the fix-negative-cosine-dist branch October 26, 2016 19:24

lanzagar mentioned this pull request Oct 28, 2016

[ENH] Manifold Learning biolab/orange3#1624

Merged

VesnaT mentioned this pull request Nov 16, 2016

OWManifoldLearning: Add cosine distance to t-SNE biolab/orange3#1756

Merged

3 tasks

Uh oh!

Conversation

asanakoy commented Oct 24, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

giorgiop commented Oct 24, 2016

Uh oh!

asanakoy commented Oct 24, 2016

Uh oh!

giorgiop commented Oct 24, 2016

Uh oh!

asanakoy commented Oct 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asanakoy Oct 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Oct 24, 2016

Uh oh!

asanakoy commented Oct 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NelleV commented Oct 26, 2016

Uh oh!

asanakoy commented Oct 26, 2016

Uh oh!

amueller commented Oct 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

asanakoy Oct 24, 2016 •

edited

Loading