Skip to content

[MGR] Add memory efficient mode for DBSCAN#6813

Closed
viirya wants to merge 5 commits intoscikit-learn:masterfrom
viirya:memory-efficient-dbscan
Closed

[MGR] Add memory efficient mode for DBSCAN#6813
viirya wants to merge 5 commits intoscikit-learn:masterfrom
viirya:memory-efficient-dbscan

Conversation

@viirya
Copy link
Copy Markdown
Contributor

@viirya viirya commented May 23, 2016

What does this implement/fix? Explain your changes.

Currently DBSCAN implementation computes distance matrix at once for nearest neighbors before performing DBSCAN algorithm. When facing large-scale data, the memory pressure would be huge and cause the process to be filled. For example, a sample program as following to process 1 million samples can't be run on a machine with 8G RAM.

This patch adds a new parameter save_memory to DBSCAN. Under this mode, we don't compute the distance matrix at once but query nearest neighbors online when performing DBSCAN algorithm. It can satisfy memory requirement for large-scale data.

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

##############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=1000000, centers=centers, cluster_std=0.4,
                            random_state=0)
X = StandardScaler().fit_transform(X)

##############################################################################
# Compute DBSCAN
db = DBSCAN(eps=1.3, algorithm="ball_tree", min_samples=10, save_memory=True).fit(X)

@jnothman
Copy link
Copy Markdown
Member

Currently there are two approaches available to handle large datasets:

  • sample weight support allows you to collapse near-duplicate points
  • precomputed sparse matrix of neighbors allows the memory costs to be limited to those within the eps neighborhood, with the computation taken out of DBSCAN's control.

Is neither acceptable to you?

@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented May 23, 2016

sample weight support allows you to collapse near-duplicate points

This seems not really resolve this issue and I think we still have this issue if dataset points are not so much duplicate in general.

precomputed sparse matrix of neighbors allows the memory costs to be limited to those within the eps neighborhood, with the computation taken out of DBSCAN's control.

For the option 2, as we choose ball-tree (or other tree-based nn), DBSCAN actually computes sparse matrix of neighbors within eps for us. However, with the example program on 1 million samples, the memory footprint of the matrix still causes the program failed. In order to make this option work for large-scale data, the allowed eps must be limited to very small range as I think. For such use cases, I think to provide an option that can significantly reduce memory pressure for large datasets is good.

@amueller
Copy link
Copy Markdown
Member

@jnothman how explicit are the docs on the precomputed sparse matrix? I think some people are missing that possibility.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Oct 13, 2016

docstring:

"
93
94 Sparse neighborhoods can be precomputed using
95 :func:`NearestNeighbors.radius_neighbors_graph
"

narrative:

"""
This implementation is by default not memory efficient because it constructs a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot be used (e.g. with sparse matrices). This matrix will consume n^2 floats. A couple of mechanisms for getting around this are:

  • A sparse radius neighborhood graph (where missing entries are presumed to be out of eps) can be precomputed in a memory-efficient way and dbscan can be run over this with metric='precomputed'.
  • The dataset can be compressed, either by removing exact duplicates if these occur in your data, or by using BIRCH. Then you only have a relatively small number of representatives for a large number of points. You can then provide a sample_weight when fitting DBSCAN.
    """

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not giving this enough love for a long time, @viirya. I suspect it's the right way to go. Are you interested in continuing to work on it?

I've only reviewed the pyx so far.

eps,
min_samples,
sample_weight,
neigh,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just pass in get_neighborhood() so this isn't littered with the parameters for query_nn

min_samples,
sample_weight,
neigh,
mode):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode, here, should become a bint named save_memory or something.

@darribas
Copy link
Copy Markdown

darribas commented May 6, 2018

What's the status of this PR? I'd be very much interested in a memory-efficient version of DBSCAN and this seems to do it. But on 0.19.1 I get (after replicating all the example code above):

In [8]: db = DBSCAN(eps=1.3, algorithm="ball_tree", min_samples=10, mode="mem").fit(X)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-74361e839f84> in <module>()
----> 1 db = DBSCAN(eps=1.3, algorithm="ball_tree", min_samples=10, mode="mem").fit(X)

TypeError: __init__() got an unexpected keyword argument 'mode'

In [9]: 

@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented May 6, 2018

If others think this is right way to go, I can continue this work.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented May 6, 2018 via email

@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented May 8, 2018

Let me sync with current branch and address above comments first.

@viirya viirya changed the title Add memory efficient mode for DBSCAN [MGR] Add memory efficient mode for DBSCAN May 20, 2018
@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented May 20, 2018

@jnothman I think I addressed your previous comments. Please help review this again. Thanks.

@viirya viirya force-pushed the memory-efficient-dbscan branch from 9f61169 to 71dfac6 Compare May 21, 2018 03:43
@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented May 21, 2018

also cc @agramfort

@jnothman
Copy link
Copy Markdown
Member

jnothman commented May 21, 2018 via email

@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented May 21, 2018

Ok. Then I'm closing this.

@viirya viirya closed this May 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants