[MRG + 1] support X_norm_squared in euclidean_distances#2459
[MRG + 1] support X_norm_squared in euclidean_distances#2459GaelVaroquaux merged 1 commit intoscikit-learn:masterfrom
Conversation
|
I don't know why the tests passed locally and failed on the server. Will look into that. |
|
This is not mergeable anymore after my optimizations in 81950ba. The |
|
Okay, looks good. I still think passing in the square norms of X makes sense; I'll update to take your optimizations into account and track down the test problem sometime soon. |
|
Updated this to use the new version, with basically the same changeset as before. I put the X_norm_squared argument last in the spec, though it should probably go earlier, to keep backwards compatibility with positional arguments. Is that something that I should do? |
|
Not sure why this got stale, looks like a good contrib. Can you rebase? |
8d2978c to
4a1b0cc
Compare
|
@amueller Okay, rebased and switched to use I still kept |
|
I'm ok with that ordering. LGTM. |
There was a problem hiding this comment.
These tests don't actually check that the args are being utilised. Do we care? Should we add tests with incorrect norms in order to ensure that there aren't bugs in the use of the norms? Or is code coverage sufficient?
There was a problem hiding this comment.
good point. It's too late in NYC. What we want to test is that it gets faster, right? But that is not nice to test.
Maybe adding a test that the squares are used would be ok. We didn't have that for Y_norm_squared, though, right?
Maybe just add a test that passes both as zero and see if it just computes -2 times the dot product?
|
Apart from that nitpick, this LGTM. |
4a1b0cc to
8d8b434
Compare
|
Added a test that the answer is wrong with the wrong squared norms. I tried @amueller's suggestion of passing zeros, but the function does |
|
@jnothman merge? |
|
LGTM. Merging. Thanks! |
[MRG + 1] support X_norm_squared in euclidean_distances
There's a comment in
euclidean_distancessayingThat's not necessarily true, though. I've run into a situation today where I have a whole bunch of sets, and need to do something based on the distances between each pair of sets. It's helpful to cache the squared norms for each of the sets; if I did that and called it with just
Y_norm_squaredfor each pair, that'd still be recomputing the norms forXall the time. (Of course, I can just do it without the helper function, which is what I'm doing now, but it's nicer to use helpers....)Another situation is when you happen to already have the squared norms for a set
Xand then you wanteuclidean_distances(X). I guessed that maybeeuclidean_distances(X, Y_norm_squared=X_norm_sq)would work, but looking at the code that doesn't actually useX_norm_sq. Noweuclidean_distancescan handle that case too.This also adds an extremely simple test that passing
X_norm_squaredand/orY_norm_squaredgives the same result; previously there was no test that usedY_norm_squared.As an aside: I have no idea why
XXis being computed withX * XandYYwithY ** 2(which necessitates the annoying copy code when it's sparse); it seems like it should be exactly the same situation, except for the very minor difference of the shape. I left it as-is, though.