[MGR] Add memory efficient mode for DBSCAN by viirya · Pull Request #6813 · scikit-learn/scikit-learn

viirya · 2016-05-23T03:08:40Z

What does this implement/fix? Explain your changes.

Currently DBSCAN implementation computes distance matrix at once for nearest neighbors before performing DBSCAN algorithm. When facing large-scale data, the memory pressure would be huge and cause the process to be filled. For example, a sample program as following to process 1 million samples can't be run on a machine with 8G RAM.

This patch adds a new parameter save_memory to DBSCAN. Under this mode, we don't compute the distance matrix at once but query nearest neighbors online when performing DBSCAN algorithm. It can satisfy memory requirement for large-scale data.

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

##############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=1000000, centers=centers, cluster_std=0.4,
                            random_state=0)
X = StandardScaler().fit_transform(X)

##############################################################################
# Compute DBSCAN
db = DBSCAN(eps=1.3, algorithm="ball_tree", min_samples=10, save_memory=True).fit(X)

jnothman · 2016-05-23T05:09:28Z

Currently there are two approaches available to handle large datasets:

sample weight support allows you to collapse near-duplicate points
precomputed sparse matrix of neighbors allows the memory costs to be limited to those within the eps neighborhood, with the computation taken out of DBSCAN's control.

Is neither acceptable to you?

viirya · 2016-05-23T06:20:24Z

sample weight support allows you to collapse near-duplicate points

This seems not really resolve this issue and I think we still have this issue if dataset points are not so much duplicate in general.

precomputed sparse matrix of neighbors allows the memory costs to be limited to those within the eps neighborhood, with the computation taken out of DBSCAN's control.

For the option 2, as we choose ball-tree (or other tree-based nn), DBSCAN actually computes sparse matrix of neighbors within eps for us. However, with the example program on 1 million samples, the memory footprint of the matrix still causes the program failed. In order to make this option work for large-scale data, the allowed eps must be limited to very small range as I think. For such use cases, I think to provide an option that can significantly reduce memory pressure for large datasets is good.

amueller · 2016-10-11T03:02:33Z

@jnothman how explicit are the docs on the precomputed sparse matrix? I think some people are missing that possibility.

jnothman · 2016-10-13T13:00:11Z

docstring:

"
93
94 Sparse neighborhoods can be precomputed using
95 :func:`NearestNeighbors.radius_neighbors_graph
"

narrative:

"""
This implementation is by default not memory efficient because it constructs a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot be used (e.g. with sparse matrices). This matrix will consume n^2 floats. A couple of mechanisms for getting around this are:

A sparse radius neighborhood graph (where missing entries are presumed to be out of eps) can be precomputed in a memory-efficient way and dbscan can be run over this with metric='precomputed'.
The dataset can be compressed, either by removing exact duplicates if these occur in your data, or by using BIRCH. Then you only have a relatively small number of representatives for a large number of points. You can then provide a sample_weight when fitting DBSCAN.
"""

jnothman

Sorry for not giving this enough love for a long time, @viirya. I suspect it's the right way to go. Are you interested in continuing to work on it?

I've only reviewed the pyx so far.

jnothman · 2017-05-07T11:48:07Z

sklearn/cluster/_dbscan_inner.pyx

+                 eps,
+                 min_samples,
+                 sample_weight,
+                 neigh,


I think we should just pass in get_neighborhood() so this isn't littered with the parameters for query_nn

jnothman · 2017-05-07T11:48:36Z

sklearn/cluster/_dbscan_inner.pyx

+                 min_samples,
+                 sample_weight,
+                 neigh,
+                 mode):


mode, here, should become a bint named save_memory or something.

darribas · 2018-05-06T07:44:17Z

What's the status of this PR? I'd be very much interested in a memory-efficient version of DBSCAN and this seems to do it. But on 0.19.1 I get (after replicating all the example code above):

In [8]: db = DBSCAN(eps=1.3, algorithm="ball_tree", min_samples=10, mode="mem").fit(X)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-74361e839f84> in <module>()
----> 1 db = DBSCAN(eps=1.3, algorithm="ball_tree", min_samples=10, mode="mem").fit(X)

TypeError: __init__() got an unexpected keyword argument 'mode'

In [9]:

viirya · 2018-05-06T08:07:07Z

If others think this is right way to go, I can continue this work.

jnothman · 2018-05-06T22:11:14Z

well we certainly have method or strategy or algorithm switches elsewhere to provide efficiency tradeoffs

viirya · 2018-05-08T08:57:29Z

Let me sync with current branch and address above comments first.

…dbscan

viirya · 2018-05-20T12:48:14Z

@jnothman I think I addressed your previous comments. Please help review this again. Thanks.

viirya · 2018-05-21T03:48:13Z

also cc @agramfort

jnothman · 2018-05-21T11:50:45Z

I'm persuaded by the opinion at #1984 (comment) that we're better off trying to merge OPTICS, which also provides a memory-efficient DBSCAN, than making this change... Sorry for wasting your effort...

viirya · 2018-05-21T13:17:27Z

Ok. Then I'm closing this.

Add memory efficient mode for DBSCAN.

82262eb

Old numpy version doesn't have 'full' method.

63a6ced

jnothman mentioned this pull request May 26, 2016

[MRG+1] Add a set recording pushed elements in DBSCAN #6799

Merged

naoyak mentioned this pull request May 1, 2017

DBSCAN slow down from 0.17.1 to 0.18.0 #8619

Closed

jnothman reviewed May 7, 2017

View reviewed changes

jnothman mentioned this pull request Jun 14, 2017

[MRG] Improve the time of DBSCAN (partly revert #6799) #9118

Merged

espg mentioned this pull request May 18, 2018

[MRG] Implementation of OPTICS #1984

Closed

Liang-Chi Hsieh added 2 commits May 20, 2018 18:27

Merge remote-tracking branch 'upstream/master' into memory-efficient-…

7722028

…dbscan

Address comments.

bcbce72

viirya changed the title ~~Add memory efficient mode for DBSCAN~~ [MGR] Add memory efficient mode for DBSCAN May 20, 2018

Fix some code format issues.

71dfac6

viirya force-pushed the memory-efficient-dbscan branch from 9f61169 to 71dfac6 Compare May 21, 2018 03:43

viirya closed this May 21, 2018

Uh oh!

Conversation

viirya commented May 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Uh oh!

jnothman commented May 23, 2016

Uh oh!

viirya commented May 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Oct 11, 2016

Uh oh!

jnothman commented Oct 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman May 7, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman May 7, 2017

Choose a reason for hiding this comment

Uh oh!

darribas commented May 6, 2018

Uh oh!

viirya commented May 6, 2018

Uh oh!

jnothman commented May 6, 2018 via email

Uh oh!

viirya commented May 8, 2018

Uh oh!

viirya commented May 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented May 21, 2018

Uh oh!

jnothman commented May 21, 2018 via email

Uh oh!

viirya commented May 21, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya commented May 23, 2016 •

edited

Loading

viirya commented May 23, 2016 •

edited

Loading

jnothman commented Oct 13, 2016 •

edited

Loading

viirya commented May 20, 2018 •

edited

Loading