[MRG + 1] fix bug with negative values in cosine_distances#7732
[MRG + 1] fix bug with negative values in cosine_distances#7732NelleV merged 4 commits intoscikit-learn:masterfrom
Conversation
clip distances to [0, 2] set distances between vectors and themselves to 0
|
I am in line with this solution (I did the same in a bigger, never merged PR #5333). |
|
Sure. Should I add it to sklearn/metrics/tests/test_pairwise.py as a separate function? |
|
I would put some both in parwise and T-SNE. You should be able to reuse some of the tests I wrote back then. But please read them critically :) |
|
The test will fail without the fix. It generates distances less than 0 and greater than 2. |
3aed213 to
0713cc2
Compare
| D = cosine_distances(XA) | ||
| assert_array_almost_equal(D, [[0., 0.], [0., 0.]]) | ||
| # check that all elements are in [0, 2] | ||
| assert_true(np.all(D >= 0.)) |
There was a problem hiding this comment.
This seems pretty vacuous after the test that all entries are zero? Maybe tests these with a larger random matrix?
There was a problem hiding this comment.
This is the specific random vectors that I found to produce negative values on diagonal if you don't clip them to [0,2].
I can add another test with
X = np.abs(rng.rand(1000, 5000))
D = cosine_distances(X)
# we get precisely 0 only if we clip previously negative values on the diagonal. So we are not sure here.
assert_array_almost_equal(D.flat[::D.shape[0] + 1], [0., 0.])
assert_true(np.all(D >= 0.))
assert_true(np.all(D <= 2.))
There was a problem hiding this comment.
I would like that. It's not saying that your test doesn't make sense, but after the first assert_array_almost_equal it is certain that the tests pass, right? I guess the >=0 checks if it's actually >= and not only "almost" but if it's "almost zero" it is certainly smaller than two.
There was a problem hiding this comment.
On the other hand checking both boundaries for both cases is fine. But having a large random matrix in addition would also be nice.
There was a problem hiding this comment.
I guess the >=0 checks if it's actually >= and not only "almost"
Absolutely. If the value is -1e-16 than it will pass assert_array_almost_equal, but the second check on >=0 will make it fail.
OK. I will add another test with random matrix 1000 x 5000 .
| D = cosine_distances(XA) | ||
| assert_array_almost_equal(D, [[0., 0.], [0., 0.]]) | ||
| # check that all elements are in [0, 2] | ||
| assert_true(np.all(D >= 0.)) |
There was a problem hiding this comment.
This is the specific random vectors that I found to produce negative values on diagonal if you don't clip them to [0,2].
I can add another test with
X = np.abs(rng.rand(1000, 5000))
D = cosine_distances(X)
# we get precisely 0 only if we clip previously negative values on the diagonal. So we are not sure here.
assert_array_almost_equal(D.flat[::D.shape[0] + 1], [0., 0.])
assert_true(np.all(D >= 0.))
assert_true(np.all(D <= 2.))
| assert_array_equal(D.flat[::D.shape[0] + 1], [0., 0.]) | ||
|
|
||
| XB = np.vstack([x, -x]) | ||
| D2 = cosine_distances(XB) |
There was a problem hiding this comment.
For this specific vectors we will have values < 0 on the diagonal and > 2.0 off diagonal if you don't clip them to [0,2].
|
LGTM, though you have a pep8 violation |
6814010 to
d5d290b
Compare
|
Sorry. Repushed ;) |
sklearn/metrics/pairwise.py
Outdated
| if X is Y or Y is None: | ||
| # Ensure that distances between vectors and themselves are set to 0.0. | ||
| # This may not be the case due to floating point rounding errors. | ||
| S.flat[::S.shape[0] + 1] = 0.0 |
There was a problem hiding this comment.
If I am not mistaken, this case leads to a squared matrix. Why not use np.diag_indices? It seems much more clear to me what the code does using the diagonal of the matrix instead of the slicing.
c2d0fe7 to
3ee77e7
Compare
|
Thanks for putting up with my annoying nitpick :) |
|
No problem! Thank you! |
|
Great, congratulations @asanakoy :) |
…arn#7732) * fix bug with negative values in cosine_distances clip distances to [0, 2] set distances between vectors and themselves to 0 * add test * add test on big random matrix * use np.diag_indices_from instead of slicing
…arn#7732) * fix bug with negative values in cosine_distances clip distances to [0, 2] set distances between vectors and themselves to 0 * add test * add test on big random matrix * use np.diag_indices_from instead of slicing
…arn#7732) * fix bug with negative values in cosine_distances clip distances to [0, 2] set distances between vectors and themselves to 0 * add test * add test on big random matrix * use np.diag_indices_from instead of slicing
…arn#7732) * fix bug with negative values in cosine_distances clip distances to [0, 2] set distances between vectors and themselves to 0 * add test * add test on big random matrix * use np.diag_indices_from instead of slicing
…arn#7732) * fix bug with negative values in cosine_distances clip distances to [0, 2] set distances between vectors and themselves to 0 * add test * add test on big random matrix * use np.diag_indices_from instead of slicing
Reference Issue
Fixes #5772
What does this implement/fix? Explain your changes.
Fix bug
cosine_distancesreturning small negative values.Essentially:
Any other comments?
Did it analogously as it was implemented in
euclidean_distances.