Skip to content

[MRG + 1] fix bug with negative values in cosine_distances#7732

Merged
NelleV merged 4 commits intoscikit-learn:masterfrom
asanakoy:fix-negative-cosine-dist
Oct 26, 2016
Merged

[MRG + 1] fix bug with negative values in cosine_distances#7732
NelleV merged 4 commits intoscikit-learn:masterfrom
asanakoy:fix-negative-cosine-dist

Conversation

@asanakoy
Copy link
Copy Markdown
Contributor

Reference Issue

Fixes #5772

What does this implement/fix? Explain your changes.

Fix bug cosine_distances returning small negative values.

Essentially:

  • clip cosine distances to [0, 2].
  • set distances between vectors and themselves to 0.

Any other comments?

Did it analogously as it was implemented in euclidean_distances.

clip distances to [0, 2]
set distances between vectors and themselves to 0
@giorgiop
Copy link
Copy Markdown
Contributor

I am in line with this solution (I did the same in a bigger, never merged PR #5333).
Could you please add a test that would break without your changes?

@asanakoy
Copy link
Copy Markdown
Contributor Author

Sure. Should I add it to sklearn/metrics/tests/test_pairwise.py as a separate function?

@giorgiop
Copy link
Copy Markdown
Contributor

I would put some both in parwise and T-SNE. You should be able to reuse some of the tests I wrote back then. But please read them critically :)

@asanakoy
Copy link
Copy Markdown
Contributor Author

The test will fail without the fix. It generates distances less than 0 and greater than 2.

@asanakoy asanakoy force-pushed the fix-negative-cosine-dist branch from 3aed213 to 0713cc2 Compare October 24, 2016 03:19
D = cosine_distances(XA)
assert_array_almost_equal(D, [[0., 0.], [0., 0.]])
# check that all elements are in [0, 2]
assert_true(np.all(D >= 0.))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty vacuous after the test that all entries are zero? Maybe tests these with a larger random matrix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the specific random vectors that I found to produce negative values on diagonal if you don't clip them to [0,2].
I can add another test with

X = np.abs(rng.rand(1000, 5000))
D = cosine_distances(X)
# we get precisely 0 only if we clip previously negative values on the diagonal. So we are not sure here.
assert_array_almost_equal(D.flat[::D.shape[0] + 1], [0., 0.])
assert_true(np.all(D >= 0.))
assert_true(np.all(D <= 2.))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like that. It's not saying that your test doesn't make sense, but after the first assert_array_almost_equal it is certain that the tests pass, right? I guess the >=0 checks if it's actually >= and not only "almost" but if it's "almost zero" it is certainly smaller than two.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand checking both boundaries for both cases is fine. But having a large random matrix in addition would also be nice.

Copy link
Copy Markdown
Contributor Author

@asanakoy asanakoy Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the >=0 checks if it's actually >= and not only "almost"

Absolutely. If the value is -1e-16 than it will pass assert_array_almost_equal, but the second check on >=0 will make it fail.

OK. I will add another test with random matrix 1000 x 5000 .

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :)

@amueller amueller changed the title fix bug with negative values in cosine_distances [MRG] fix bug with negative values in cosine_distances Oct 24, 2016
D = cosine_distances(XA)
assert_array_almost_equal(D, [[0., 0.], [0., 0.]])
# check that all elements are in [0, 2]
assert_true(np.all(D >= 0.))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the specific random vectors that I found to produce negative values on diagonal if you don't clip them to [0,2].
I can add another test with

X = np.abs(rng.rand(1000, 5000))
D = cosine_distances(X)
# we get precisely 0 only if we clip previously negative values on the diagonal. So we are not sure here.
assert_array_almost_equal(D.flat[::D.shape[0] + 1], [0., 0.])
assert_true(np.all(D >= 0.))
assert_true(np.all(D <= 2.))

assert_array_equal(D.flat[::D.shape[0] + 1], [0., 0.])

XB = np.vstack([x, -x])
D2 = cosine_distances(XB)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific vectors we will have values < 0 on the diagonal and > 2.0 off diagonal if you don't clip them to [0,2].

@amueller amueller changed the title [MRG] fix bug with negative values in cosine_distances [MRG + 1] fix bug with negative values in cosine_distances Oct 24, 2016
@amueller
Copy link
Copy Markdown
Member

LGTM, though you have a pep8 violation

@asanakoy asanakoy force-pushed the fix-negative-cosine-dist branch from 6814010 to d5d290b Compare October 24, 2016 20:30
@asanakoy
Copy link
Copy Markdown
Contributor Author

Sorry. Repushed ;)

@amueller amueller added this to the 0.18.1 milestone Oct 24, 2016
if X is Y or Y is None:
# Ensure that distances between vectors and themselves are set to 0.0.
# This may not be the case due to floating point rounding errors.
S.flat[::S.shape[0] + 1] = 0.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, this case leads to a squared matrix. Why not use np.diag_indices? It seems much more clear to me what the code does using the diagonal of the matrix instead of the slicing.

@asanakoy asanakoy force-pushed the fix-negative-cosine-dist branch from c2d0fe7 to 3ee77e7 Compare October 26, 2016 00:42
@NelleV
Copy link
Copy Markdown
Member

NelleV commented Oct 26, 2016

Thanks for putting up with my annoying nitpick :)

@NelleV NelleV merged commit 0ea8e8b into scikit-learn:master Oct 26, 2016
@asanakoy
Copy link
Copy Markdown
Contributor Author

No problem! Thank you!
My first contribution to sklearn :) I hope to be helpful further!

@amueller
Copy link
Copy Markdown
Member

Great, congratulations @asanakoy :)

@asanakoy asanakoy deleted the fix-negative-cosine-dist branch October 26, 2016 19:24
@asanakoy asanakoy restored the fix-negative-cosine-dist branch October 26, 2016 19:24
amueller pushed a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016
…arn#7732)

* fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0

* add test

* add test on big random matrix

* use np.diag_indices_from instead of slicing
sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
…arn#7732)

* fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0

* add test

* add test on big random matrix

* use np.diag_indices_from instead of slicing
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…arn#7732)

* fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0

* add test

* add test on big random matrix

* use np.diag_indices_from instead of slicing
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…arn#7732)

* fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0

* add test

* add test on big random matrix

* use np.diag_indices_from instead of slicing
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…arn#7732)

* fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0

* add test

* add test on big random matrix

* use np.diag_indices_from instead of slicing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error thrown when using t-SNE with 'cosine' as metric

4 participants