Skip to content

Cosine distance metric for sparse matrices#2368

Merged
mblondel merged 3 commits intoscikit-learn:masterfrom
emsrc:cosine_distance
Aug 19, 2013
Merged

Cosine distance metric for sparse matrices#2368
mblondel merged 3 commits intoscikit-learn:masterfrom
emsrc:cosine_distance

Conversation

@emsrc
Copy link
Copy Markdown
Contributor

@emsrc emsrc commented Aug 18, 2013

I wanted to use Cosine as a distance metric in the NearestCentroid classifier, but the existing implementation calls scipy.spatial.distance.cosine, which cannot handle sparse matrices. The same issue started #732. Although a lot of good things resulted from this, it seems the issue that started it -- Cosine distance for sparse matrices -- was never addressed.

As there already is a Cosine similarity metric that works with sparse input (sklearn.metrics.pairwise.cosine_similarity), adding a corresponding distance metric is trivial (simply 1.0 minus Cosine sim).

I incidentally noticed that the doc string for the function 'pairwise_distances' suggests that sparse matrices are supported for all metrics in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. However, all metrics calling on 'manhattan_distances' (i.e. 'cityblock', 'l1' and 'manhattan') raise an exception with sparse input. So I took the liberty of correcting that doc string too.

To sum up the changes:

  • Added 'cosine_distances' function to sklearn.metrics.pairwise.
  • Added 'cosine' as metric in 'pairwise_distances' function.
  • Corrected doc string of same function, because all metrics based on
    the 'manhattan_distances' function (i.e. 'cityblock', 'l1', and
    'manhattan') do currently NOT support sparse matrices.
  • Added corresponding corresponding unit test.

emsrc added 2 commits August 18, 2013 19:17
- Added 'cosine_distances' function to sklearn.metrics.pairwise.
- Added 'cosine' as metric in 'pairwise_distances' function.
- Corrected doc string of same function, because all metrics based on
the 'manhattan_distances' function (i.e. 'cityblock', 'l1', and
'manhattan') do currently NOT support sparse matrices.
- Added corresponding corresponding unit test.
@amueller
Copy link
Copy Markdown
Member

LGTM thanks a lot :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distance matrix

@emsrc
Copy link
Copy Markdown
Contributor Author

emsrc commented Aug 19, 2013

@mblondel: fixed doc string and adopted your suggestion for subtraction without copy. Thanks!

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Aug 19, 2013

+1 for me as well. Waiting for @mblondel final comments / or merge.

mblondel added a commit that referenced this pull request Aug 19, 2013
Cosine distance metric for sparse matrices
@mblondel mblondel merged commit dcf827e into scikit-learn:master Aug 19, 2013
@mblondel
Copy link
Copy Markdown
Member

Merged, thanks for the contrib!

@emsrc
Copy link
Copy Markdown
Contributor Author

emsrc commented Aug 19, 2013

Thanks for accepting this. Quick turnaround time! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants