[MGR] Add memory efficient mode for DBSCAN#6813
[MGR] Add memory efficient mode for DBSCAN#6813viirya wants to merge 5 commits intoscikit-learn:masterfrom
Conversation
|
Currently there are two approaches available to handle large datasets:
Is neither acceptable to you? |
This seems not really resolve this issue and I think we still have this issue if dataset points are not so much duplicate in general.
For the option 2, as we choose ball-tree (or other tree-based nn), DBSCAN actually computes sparse matrix of neighbors within |
|
@jnothman how explicit are the docs on the precomputed sparse matrix? I think some people are missing that possibility. |
|
docstring: " narrative: """
|
sklearn/cluster/_dbscan_inner.pyx
Outdated
| eps, | ||
| min_samples, | ||
| sample_weight, | ||
| neigh, |
There was a problem hiding this comment.
I think we should just pass in get_neighborhood() so this isn't littered with the parameters for query_nn
sklearn/cluster/_dbscan_inner.pyx
Outdated
| min_samples, | ||
| sample_weight, | ||
| neigh, | ||
| mode): |
There was a problem hiding this comment.
mode, here, should become a bint named save_memory or something.
|
What's the status of this PR? I'd be very much interested in a memory-efficient version of DBSCAN and this seems to do it. But on 0.19.1 I get (after replicating all the example code above): In [8]: db = DBSCAN(eps=1.3, algorithm="ball_tree", min_samples=10, mode="mem").fit(X)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-74361e839f84> in <module>()
----> 1 db = DBSCAN(eps=1.3, algorithm="ball_tree", min_samples=10, mode="mem").fit(X)
TypeError: __init__() got an unexpected keyword argument 'mode'
In [9]: |
|
If others think this is right way to go, I can continue this work. |
|
well we certainly have method or strategy or algorithm switches elsewhere
to provide efficiency tradeoffs
|
|
Let me sync with current branch and address above comments first. |
|
@jnothman I think I addressed your previous comments. Please help review this again. Thanks. |
9f61169 to
71dfac6
Compare
|
also cc @agramfort |
|
I'm persuaded by the opinion at
#1984 (comment)
that we're better off trying to merge OPTICS, which also provides a
memory-efficient DBSCAN, than making this change... Sorry for wasting your
effort...
|
|
Ok. Then I'm closing this. |
What does this implement/fix? Explain your changes.
Currently DBSCAN implementation computes distance matrix at once for nearest neighbors before performing DBSCAN algorithm. When facing large-scale data, the memory pressure would be huge and cause the process to be filled. For example, a sample program as following to process 1 million samples can't be run on a machine with 8G RAM.
This patch adds a new parameter
save_memorytoDBSCAN. Under this mode, we don't compute the distance matrix at once but query nearest neighbors online when performing DBSCAN algorithm. It can satisfy memory requirement for large-scale data.