Skip to content

[MRG+1] Add a set recording pushed elements in DBSCAN#6799

Merged
agramfort merged 3 commits intoscikit-learn:masterfrom
viirya:pushmap-for-dbscan
Jun 3, 2016
Merged

[MRG+1] Add a set recording pushed elements in DBSCAN#6799
agramfort merged 3 commits intoscikit-learn:masterfrom
viirya:pushmap-for-dbscan

Conversation

@viirya
Copy link
Copy Markdown
Contributor

@viirya viirya commented May 19, 2016

What does this implement/fix? Explain your changes.

We push the neighbors waiting for visiting into a vector in DBSCAN. When the data is big and eps is large enough, the neighbors could be many and duplicate. In this case, the duplicate elements in this vector cause memory pressure. This patch adds a set to record pushed elements and avoids duplicate elements to be pushed.

@viirya viirya changed the title Add a push map recording pushed elements in DBSCAN [MRG] Add a push map recording pushed elements in DBSCAN May 21, 2016
@agramfort
Copy link
Copy Markdown
Member

it is possible to have a test ? gist to better understand the pb?

@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented May 24, 2016

hmm, as it is an internal variable, I think it might be difficult to test it? Besides, I think this change should be straightforward.

@agramfort
Copy link
Copy Markdown
Member

agramfort commented May 24, 2016 via email

cdef np.npy_intp i, label_num = 0, v
cdef np.ndarray[np.npy_intp, ndim=1] neighb
cdef vector[np.npy_intp] stack
cdef cmap[np.npy_intp, np.npy_intp] push_map
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be using set

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea. Thanks. I will update this later.

@jnothman
Copy link
Copy Markdown
Member

@agramfort I actually think this is fixing an algorithm bug: we shouldn't need to visit the same point twice.

In this kind of search algorithm, push_map would better be a set named seen.

@viirya viirya changed the title [MRG] Add a push map recording pushed elements in DBSCAN [MRG] Add a set recording pushed elements in DBSCAN May 24, 2016
@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented May 24, 2016

Benchmark with the following sample codes:

import time

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler


##############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=10000, centers=centers, cluster_std=0.4,
                            random_state=0)

X = StandardScaler().fit_transform(X)
##############################################################################
# Compute DBSCAN
t1 = time.time()
db = DBSCAN(eps=0.1, algorithm="ball_tree", min_samples=10).fit(X)
t2 = time.time()
print "Time = %s" % (t2 - t1)

Without this patch: Time = 0.110082864761

With this patch: Time = 0.136781930923

@agramfort
Copy link
Copy Markdown
Member

@jnothman merge if you're happy. No strong feeling here.

@jnothman
Copy link
Copy Markdown
Member

@viirya can you confirm that this makes a substantial reduction to the memory usage in some reasonable cases? Can you give a sense of how much (by using memory_profiler, or counting the number of times seen.count() == 1)?

@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented May 25, 2016

@jnothman Using the sample codes above, with 10000 sample data, I count the number of times seen.count(v) == 1, the number is 548248.

@jnothman
Copy link
Copy Markdown
Member

To clarify, in your initial description you say "the duplicate elements in this vector cause memory pressure". Is that because you actually had a problem where there was a memory shortage, which this patch fixes? It still seems surprising.

@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented May 26, 2016

When you run the algorithm with more than 1 million sample data, with enough big eps, the vector will soon have many duplicate elements. It consumes much memory to cause the program killed if your memory is already occupied by the huge distance matrix or other variables. I am not sure if I answer your question. But it is the motivation of this patch.

@jnothman
Copy link
Copy Markdown
Member

I guess I'd like to clarify whether the big distance matrix is really the
issue there, for which this is small beans (costing a small increase in
runtime) relative to #6813.

On 26 May 2016 at 13:39, Liang-Chi Hsieh notifications@github.com wrote:

When you run the algorithm with more than 1 million sample data, with
enough big eps, the vector will soon have many duplicate elements. It
consumes much memory to cause the program killed if your memory is already
occupied by the huge distance matrix or other variables. I am not sure if I
answer your question. But it is the motivation of this patch.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6799 (comment)

@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented May 26, 2016

oh. Actually, this is also helpful even #6813 is applied. I applied #6813 but it can still fail under some cases due to the duplicate elements in the vector.

@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented Jun 1, 2016

ping @jnothman any more concerns about this?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Jun 1, 2016

LGTM. Usually we get two reviews, but @agramfort's was a +0, so I'll wait for another.

@jnothman jnothman changed the title [MRG] Add a set recording pushed elements in DBSCAN [MRG+1] Add a set recording pushed elements in DBSCAN Jun 1, 2016
v = neighb[i]
if labels[v] == -1:
if labels[v] == -1 and seen.count(v) == 0:
seen.insert(v)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me you can have a noise point that becomes a non-core point (in a cluster) afterwards. Is this still possible with this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change would not change previous result.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change would not change previous result.

I agree.

@agramfort
Copy link
Copy Markdown
Member

ok fair enough

please fix the docstring and +1 for merge

@viirya
Copy link
Copy Markdown
Contributor Author

viirya commented Jun 3, 2016

@jnothman @agramfort Thanks! I've updated the document.

@agramfort agramfort merged commit 95c4172 into scikit-learn:master Jun 3, 2016
@agramfort
Copy link
Copy Markdown
Member

thanks @viirya

olologin pushed a commit to olologin/scikit-learn that referenced this pull request Aug 24, 2016
)

* Add a push map recording pushed elements.

* Use set instead of map.

* Documented the p parameter.
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
)

* Add a push map recording pushed elements.

* Use set instead of map.

* Documented the p parameter.
ogrisel pushed a commit that referenced this pull request Jun 15, 2017
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants