Cosine distance metric for sparse matrices#2368
Merged
mblondel merged 3 commits intoscikit-learn:masterfrom Aug 19, 2013
emsrc:cosine_distance
Merged
Cosine distance metric for sparse matrices#2368mblondel merged 3 commits intoscikit-learn:masterfrom emsrc:cosine_distance
mblondel merged 3 commits intoscikit-learn:masterfrom
emsrc:cosine_distance
Conversation
- Added 'cosine_distances' function to sklearn.metrics.pairwise. - Added 'cosine' as metric in 'pairwise_distances' function. - Corrected doc string of same function, because all metrics based on the 'manhattan_distances' function (i.e. 'cityblock', 'l1', and 'manhattan') do currently NOT support sparse matrices. - Added corresponding corresponding unit test.
Member
|
LGTM thanks a lot :) |
sklearn/metrics/pairwise.py
Outdated
Contributor
Author
|
@mblondel: fixed doc string and adopted your suggestion for subtraction without copy. Thanks! |
Member
|
+1 for me as well. Waiting for @mblondel final comments / or merge. |
mblondel
added a commit
that referenced
this pull request
Aug 19, 2013
Cosine distance metric for sparse matrices
Member
|
Merged, thanks for the contrib! |
Contributor
Author
|
Thanks for accepting this. Quick turnaround time! :-) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I wanted to use Cosine as a distance metric in the NearestCentroid classifier, but the existing implementation calls scipy.spatial.distance.cosine, which cannot handle sparse matrices. The same issue started #732. Although a lot of good things resulted from this, it seems the issue that started it -- Cosine distance for sparse matrices -- was never addressed.
As there already is a Cosine similarity metric that works with sparse input (sklearn.metrics.pairwise.cosine_similarity), adding a corresponding distance metric is trivial (simply 1.0 minus Cosine sim).
I incidentally noticed that the doc string for the function 'pairwise_distances' suggests that sparse matrices are supported for all metrics in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. However, all metrics calling on 'manhattan_distances' (i.e. 'cityblock', 'l1' and 'manhattan') raise an exception with sparse input. So I took the liberty of correcting that doc string too.
To sum up the changes:
the 'manhattan_distances' function (i.e. 'cityblock', 'l1', and
'manhattan') do currently NOT support sparse matrices.