[MRG + 1] support X_norm_squared in euclidean_distances by djsutherland · Pull Request #2459 · scikit-learn/scikit-learn

djsutherland · 2013-09-18T23:56:22Z

There's a comment in euclidean_distances saying

# should not need X_norm_squared because if you could precompute that as
# well as Y, then you should just pre-compute the output and not even
# call this function.

That's not necessarily true, though. I've run into a situation today where I have a whole bunch of sets, and need to do something based on the distances between each pair of sets. It's helpful to cache the squared norms for each of the sets; if I did that and called it with just Y_norm_squared for each pair, that'd still be recomputing the norms for X all the time. (Of course, I can just do it without the helper function, which is what I'm doing now, but it's nicer to use helpers....)

Another situation is when you happen to already have the squared norms for a set X and then you want euclidean_distances(X). I guessed that maybe euclidean_distances(X, Y_norm_squared=X_norm_sq) would work, but looking at the code that doesn't actually use X_norm_sq. Now euclidean_distances can handle that case too.

This also adds an extremely simple test that passing X_norm_squared and/or Y_norm_squared gives the same result; previously there was no test that used Y_norm_squared.

As an aside: I have no idea why XX is being computed with X * X and YY with Y ** 2 (which necessitates the annoying copy code when it's sparse); it seems like it should be exactly the same situation, except for the very minor difference of the shape. I left it as-is, though.

djsutherland · 2013-09-19T07:13:08Z

I don't know why the tests passed locally and failed on the server. Will look into that.

larsmans · 2013-10-02T17:20:40Z

This is not mergeable anymore after my optimizations in 81950ba. The X*X and Y**2 difference is gone, as the squaring is now offloaded to a helper that calls einsum when possible.

djsutherland · 2013-10-02T17:45:24Z

Okay, looks good. I still think passing in the square norms of X makes sense; I'll update to take your optimizations into account and track down the test problem sometime soon.

coveralls · 2014-01-17T05:58:20Z

Coverage remained the same when pulling 8d2978c on dougalsutherland:euclidean_distances into e54c54a on scikit-learn:master.

djsutherland · 2014-01-17T06:00:16Z

Updated this to use the new version, with basically the same changeset as before. I put the X_norm_squared argument last in the spec, though it should probably go earlier, to keep backwards compatibility with positional arguments. Is that something that I should do?

amueller · 2015-08-25T18:46:12Z

Not sure why this got stale, looks like a good contrib. Can you rebase?

djsutherland · 2015-08-25T20:22:29Z

@amueller Okay, rebased and switched to use check_array.

I still kept X_norm_squared as the last argument in case people are calling it with positional arguments, despite the order looking ugly.

amueller · 2015-08-27T03:15:54Z

I'm ok with that ordering. LGTM.

jnothman · 2015-08-27T03:41:00Z

sklearn/metrics/tests/test_pairwise.py

These tests don't actually check that the args are being utilised. Do we care? Should we add tests with incorrect norms in order to ensure that there aren't bugs in the use of the norms? Or is code coverage sufficient?

good point. It's too late in NYC. What we want to test is that it gets faster, right? But that is not nice to test.
Maybe adding a test that the squares are used would be ok. We didn't have that for Y_norm_squared, though, right?
Maybe just add a test that passes both as zero and see if it just computes -2 times the dot product?

jnothman · 2015-08-27T03:47:03Z

Apart from that nitpick, this LGTM.

djsutherland · 2015-08-30T19:55:35Z

Added a test that the answer is wrong with the wrong squared norms.

I tried @amueller's suggestion of passing zeros, but the function does maximum(distances, 0), so instead of -2 * X.dot(Y.T) it returned just all zeros. So I added a stupid test where it halves the squared norms and makes sure that at least one distance was changed by more than .01, because whatever.

amueller · 2015-08-31T18:29:06Z

@jnothman merge?

GaelVaroquaux · 2015-09-01T05:43:16Z

LGTM. Merging. Thanks!

[MRG + 1] support X_norm_squared in euclidean_distances

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

djsutherland force-pushed the euclidean_distances branch from 8d2978c to 4a1b0cc Compare August 25, 2015 20:19

amueller changed the title ~~support X_norm_squared in euclidean_distances~~ [MRG + 1] support X_norm_squared in euclidean_distances Aug 27, 2015

jnothman reviewed Aug 27, 2015
View reviewed changes

support X_norm_squared in euclidean_distances

8d8b434

djsutherland force-pushed the euclidean_distances branch from 4a1b0cc to 8d8b434 Compare August 30, 2015 19:53

GaelVaroquaux added a commit that referenced this pull request Sep 1, 2015

Merge pull request #2459 from dougalsutherland/euclidean_distances

4d39cf8

[MRG + 1] support X_norm_squared in euclidean_distances

GaelVaroquaux merged commit 4d39cf8 into scikit-learn:master Sep 1, 2015

qinhanmin2014 mentioned this pull request Nov 7, 2017

[WIP] Providing stable implementation for euclidean_distances #10069

Closed

Uh oh!

Conversation

djsutherland commented Sep 18, 2013

Uh oh!

djsutherland commented Sep 19, 2013

Uh oh!

larsmans commented Oct 2, 2013

Uh oh!

djsutherland commented Oct 2, 2013

Uh oh!

coveralls commented Jan 17, 2014

Uh oh!

djsutherland commented Jan 17, 2014

Uh oh!

amueller commented Aug 25, 2015

Uh oh!

djsutherland commented Aug 25, 2015

Uh oh!

amueller commented Aug 27, 2015

Uh oh!

jnothman Aug 27, 2015

Choose a reason for hiding this comment

Uh oh!

amueller Aug 27, 2015

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 27, 2015

Uh oh!

djsutherland commented Aug 30, 2015

Uh oh!

amueller commented Aug 31, 2015

Uh oh!

GaelVaroquaux commented Sep 1, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants