[MRG+1] Add a set recording pushed elements in DBSCAN#6799
[MRG+1] Add a set recording pushed elements in DBSCAN#6799agramfort merged 3 commits intoscikit-learn:masterfrom
Conversation
|
it is possible to have a test ? gist to better understand the pb? |
|
hmm, as it is an internal variable, I think it might be difficult to test it? Besides, I think this change should be straightforward. |
|
any performance drop?
|
sklearn/cluster/_dbscan_inner.pyx
Outdated
| cdef np.npy_intp i, label_num = 0, v | ||
| cdef np.ndarray[np.npy_intp, ndim=1] neighb | ||
| cdef vector[np.npy_intp] stack | ||
| cdef cmap[np.npy_intp, np.npy_intp] push_map |
There was a problem hiding this comment.
Yea. Thanks. I will update this later.
|
@agramfort I actually think this is fixing an algorithm bug: we shouldn't need to visit the same point twice. In this kind of search algorithm, |
|
Benchmark with the following sample codes: Without this patch: Time = 0.110082864761 With this patch: Time = 0.136781930923 |
|
@jnothman merge if you're happy. No strong feeling here. |
|
@viirya can you confirm that this makes a substantial reduction to the memory usage in some reasonable cases? Can you give a sense of how much (by using |
|
@jnothman Using the sample codes above, with 10000 sample data, I count the number of times |
|
To clarify, in your initial description you say "the duplicate elements in this vector cause memory pressure". Is that because you actually had a problem where there was a memory shortage, which this patch fixes? It still seems surprising. |
|
When you run the algorithm with more than 1 million sample data, with enough big |
|
I guess I'd like to clarify whether the big distance matrix is really the On 26 May 2016 at 13:39, Liang-Chi Hsieh notifications@github.com wrote:
|
|
ping @jnothman any more concerns about this? |
|
LGTM. Usually we get two reviews, but @agramfort's was a +0, so I'll wait for another. |
| v = neighb[i] | ||
| if labels[v] == -1: | ||
| if labels[v] == -1 and seen.count(v) == 0: | ||
| seen.insert(v) |
There was a problem hiding this comment.
to me you can have a noise point that becomes a non-core point (in a cluster) afterwards. Is this still possible with this change?
There was a problem hiding this comment.
I think this change would not change previous result.
There was a problem hiding this comment.
There was a problem hiding this comment.
I think this change would not change previous result.
I agree.
|
ok fair enough please fix the docstring and +1 for merge |
|
@jnothman @agramfort Thanks! I've updated the document. |
|
thanks @viirya |
What does this implement/fix? Explain your changes.
We push the neighbors waiting for visiting into a vector in DBSCAN. When the data is big and eps is large enough, the neighbors could be many and duplicate. In this case, the duplicate elements in this vector cause memory pressure. This patch adds a set to record pushed elements and avoids duplicate elements to be pushed.