[MRG+1] Add a set recording pushed elements in DBSCAN by viirya · Pull Request #6799 · scikit-learn/scikit-learn

viirya · 2016-05-19T02:59:54Z

What does this implement/fix? Explain your changes.

We push the neighbors waiting for visiting into a vector in DBSCAN. When the data is big and eps is large enough, the neighbors could be many and duplicate. In this case, the duplicate elements in this vector cause memory pressure. This patch adds a set to record pushed elements and avoids duplicate elements to be pushed.

agramfort · 2016-05-23T14:50:28Z

it is possible to have a test ? gist to better understand the pb?

viirya · 2016-05-24T04:36:53Z

hmm, as it is an internal variable, I think it might be difficult to test it? Besides, I think this change should be straightforward.

agramfort · 2016-05-24T06:03:18Z

any performance drop?

jnothman · 2016-05-24T07:29:41Z

sklearn/cluster/_dbscan_inner.pyx

    cdef np.npy_intp i, label_num = 0, v
    cdef np.ndarray[np.npy_intp, ndim=1] neighb
    cdef vector[np.npy_intp] stack
+    cdef cmap[np.npy_intp, np.npy_intp] push_map


I think you should be using set

Yea. Thanks. I will update this later.

jnothman · 2016-05-24T07:32:10Z

@agramfort I actually think this is fixing an algorithm bug: we shouldn't need to visit the same point twice.

In this kind of search algorithm, push_map would better be a set named seen.

viirya · 2016-05-24T12:13:53Z

Benchmark with the following sample codes:

import time

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler


##############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=10000, centers=centers, cluster_std=0.4,
                            random_state=0)

X = StandardScaler().fit_transform(X)
##############################################################################
# Compute DBSCAN
t1 = time.time()
db = DBSCAN(eps=0.1, algorithm="ball_tree", min_samples=10).fit(X)
t2 = time.time()
print "Time = %s" % (t2 - t1)

Without this patch: Time = 0.110082864761

With this patch: Time = 0.136781930923

agramfort · 2016-05-24T19:22:44Z

@jnothman merge if you're happy. No strong feeling here.

jnothman · 2016-05-25T00:12:21Z

@viirya can you confirm that this makes a substantial reduction to the memory usage in some reasonable cases? Can you give a sense of how much (by using memory_profiler, or counting the number of times seen.count() == 1)?

viirya · 2016-05-25T02:51:15Z

@jnothman Using the sample codes above, with 10000 sample data, I count the number of times seen.count(v) == 1, the number is 548248.

jnothman · 2016-05-26T02:44:55Z

To clarify, in your initial description you say "the duplicate elements in this vector cause memory pressure". Is that because you actually had a problem where there was a memory shortage, which this patch fixes? It still seems surprising.

viirya · 2016-05-26T03:39:04Z

When you run the algorithm with more than 1 million sample data, with enough big eps, the vector will soon have many duplicate elements. It consumes much memory to cause the program killed if your memory is already occupied by the huge distance matrix or other variables. I am not sure if I answer your question. But it is the motivation of this patch.

jnothman · 2016-05-26T03:52:38Z

I guess I'd like to clarify whether the big distance matrix is really the
issue there, for which this is small beans (costing a small increase in
runtime) relative to #6813.

On 26 May 2016 at 13:39, Liang-Chi Hsieh notifications@github.com wrote:

When you run the algorithm with more than 1 million sample data, with
enough big eps, the vector will soon have many duplicate elements. It
consumes much memory to cause the program killed if your memory is already
occupied by the huge distance matrix or other variables. I am not sure if I
answer your question. But it is the motivation of this patch.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6799 (comment)

viirya · 2016-05-26T07:07:21Z

oh. Actually, this is also helpful even #6813 is applied. I applied #6813 but it can still fail under some cases due to the duplicate elements in the vector.

viirya · 2016-06-01T04:51:42Z

ping @jnothman any more concerns about this?

jnothman · 2016-06-01T05:08:27Z

LGTM. Usually we get two reviews, but @agramfort's was a +0, so I'll wait for another.

agramfort · 2016-06-02T14:47:59Z

sklearn/cluster/_dbscan_inner.pyx

                        v = neighb[i]
-                        if labels[v] == -1:
+                        if labels[v] == -1 and seen.count(v) == 0:
+                            seen.insert(v)


to me you can have a noise point that becomes a non-core point (in a cluster) afterwards. Is this still possible with this change?

I think this change would not change previous result.

if you say so... btw I just noticed that the p parameter in not documented in the docstring can you fix it

I think this change would not change previous result.

I agree.

agramfort · 2016-06-02T21:25:05Z

ok fair enough

please fix the docstring and +1 for merge

viirya · 2016-06-03T06:35:33Z

@jnothman @agramfort Thanks! I've updated the document.

agramfort · 2016-06-03T09:32:29Z

thanks @viirya

) * Add a push map recording pushed elements. * Use set instead of map. * Documented the p parameter.

Add a push map recording pushed elements.

18b6586

viirya changed the title ~~Add a push map recording pushed elements in DBSCAN~~ [MRG] Add a push map recording pushed elements in DBSCAN May 21, 2016

jnothman reviewed May 24, 2016
View reviewed changes

Use set instead of map.

d511a6d

viirya changed the title ~~[MRG] Add a push map recording pushed elements in DBSCAN~~ [MRG] Add a set recording pushed elements in DBSCAN May 24, 2016

jnothman added the Waiting for Reviewer label Jun 1, 2016

jnothman changed the title ~~[MRG] Add a set recording pushed elements in DBSCAN~~ [MRG+1] Add a set recording pushed elements in DBSCAN Jun 1, 2016

agramfort reviewed Jun 2, 2016
View reviewed changes

Documented the p parameter.

22bfbf6

agramfort merged commit 95c4172 into scikit-learn:master Jun 3, 2016

olologin pushed a commit to olologin/scikit-learn that referenced this pull request Aug 24, 2016

[MRG+1] Add a set recording pushed elements in DBSCAN (scikit-learn#6799

b3d3e8b

) * Add a push map recording pushed elements. * Use set instead of map. * Documented the p parameter.

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+1] Add a set recording pushed elements in DBSCAN (scikit-learn#6799

dc00022

) * Add a push map recording pushed elements. * Use set instead of map. * Documented the p parameter.

naoyak mentioned this pull request May 1, 2017

DBSCAN slow down from 0.17.1 to 0.18.0 #8619

Closed

qinhanmin2014 mentioned this pull request Jun 14, 2017

[MRG] Improve the time of DBSCAN (partly revert #6799) #9118

Merged

ogrisel pushed a commit that referenced this pull request Jun 15, 2017

Partly revert #6799 to keep DBSCAN fast (#9118)

c393bd4

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

Partly revert scikit-learn#6799 to keep DBSCAN fast (scikit-learn#9118)

4641caf

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

Partly revert scikit-learn#6799 to keep DBSCAN fast (scikit-learn#9118)

91f6e3f

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

Partly revert scikit-learn#6799 to keep DBSCAN fast (scikit-learn#9118)

fb27274

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

Partly revert scikit-learn#6799 to keep DBSCAN fast (scikit-learn#9118)

1f9fa07

AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

Partly revert scikit-learn#6799 to keep DBSCAN fast (scikit-learn#9118)

2ce52fd

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

Partly revert scikit-learn#6799 to keep DBSCAN fast (scikit-learn#9118)

49a59cf

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Partly revert scikit-learn#6799 to keep DBSCAN fast (scikit-learn#9118)

2d7369d

Uh oh!

Conversation

viirya commented May 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Uh oh!

agramfort commented May 23, 2016

Uh oh!

viirya commented May 24, 2016

Uh oh!

agramfort commented May 24, 2016 via email

Uh oh!

jnothman May 24, 2016

Choose a reason for hiding this comment

Uh oh!

viirya May 24, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman commented May 24, 2016

Uh oh!

viirya commented May 24, 2016

Uh oh!

agramfort commented May 24, 2016

Uh oh!

jnothman commented May 25, 2016

Uh oh!

viirya commented May 25, 2016

Uh oh!

jnothman commented May 26, 2016

Uh oh!

viirya commented May 26, 2016

Uh oh!

jnothman commented May 26, 2016

Uh oh!

viirya commented May 26, 2016

Uh oh!

viirya commented Jun 1, 2016

Uh oh!

jnothman commented Jun 1, 2016

Uh oh!

agramfort Jun 2, 2016

Choose a reason for hiding this comment

Uh oh!

viirya Jun 2, 2016

Choose a reason for hiding this comment

Uh oh!

agramfort Jun 2, 2016 via email

Choose a reason for hiding this comment

Uh oh!

jnothman Jun 2, 2016

Choose a reason for hiding this comment

Uh oh!

agramfort commented Jun 2, 2016

Uh oh!

viirya commented Jun 3, 2016

Uh oh!

agramfort commented Jun 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

viirya commented May 19, 2016 •

edited

Loading