[MRG+1] DBSCAN: faster, weighted samples, and sparse input#3994
[MRG+1] DBSCAN: faster, weighted samples, and sparse input#3994agramfort merged 5 commits intoscikit-learn:masterfrom
Conversation
|
Note because this permutes over a different array, results will not be identical to previous versions for a fixed random state. |
|
I have extended this PR from its initial purpose, such that DBSCAN now supports sample weights. Ping @robertlayton |
|
Hmm... just realised this leaves |
I think so. |
sklearn/cluster/dbscan_.py
Outdated
There was a problem hiding this comment.
point i is in the neighborhood of point i. While true, it is useless information).
|
This looks great. +1 for merge. Don't forget to update the |
|
Thanks for the review (and the +1), @ogrisel! Your comments have been addressed. |
|
+1 for merge. nice job @jnothman ! |
[MRG+1] DBSCAN: faster, weighted samples, and sparse input
|
Hmm... I just noticed that DBSCAN isn't really well-tested, and in particular the boundary case of min_samples == number within radius isn't tested, despite the note: This PR broke the previous behaviour but correctly matched the |
DBSCAN now supports sparse matrix input, and weighted samples (a compact representation of density when duplicate points exist; also useful when weighting is possible for BIRCH's global clustering).
I have also vectorized the implementation. This reduces the cluster comparison toy examples under DBSCAN from ~.6s each to .02s each on my machine.
(This could be sped up further by allowing a dual-tree radius_neighbors lookup that reuses the computed tree. The toy examples have
n_features=2, so neighbor calculation does not take up as much of the overall time as it might otherwise.)