[MRG+1] Clustering algorithm - BIRCH #3802
Conversation
|
support for new metrics is low priority. The idea of sum of squares as sufficient stats is designed for euclidean norm. support for sparse matrices should be straightforward. I don't see why it would not work. |
sklearn/cluster/birch.py
Outdated
There was a problem hiding this comment.
Minor style remark: the name should be new_subcluster1 :).
|
OVerall, I am impressed that you managed to put this together that fast. Making it fast will probably require porting part of the code to Cython. |
|
Thanks. It is always nice to be appreciated. |
|
As a general comment, before going low level to speed things up, I think that it would help having a high-level reflection on the structure of the code: There are two classes. Can these 2 classes be collapsed into one? Less levels of indirection make it easier to profile and optimize. It also makes simpler Cython code. What is the best way to store the data structures? We have here lists of objects, that have attributes that store the interesting information. Can we have lists / arrays of these interesting info? (unlikely, but worth thinking). To think about that, I would make a diagram with all the data-structures layed out, and what they contribute to the algorithm. |
|
|
@jnothman @agramfort Thanks for all your comments. I shall give detailed response tomorrow. |
|
I am responding to @jnothman 's major comment (and in a way to @GaelVaroquaux 's too) . Yes, I did not include the global clustering step, in which the leaf sub-clusters are grouped back (according to an arbitrary clustering algorithm). In that way
However I am not sure how are we to accommodate this in the public API. Or do we just allow the user to give an option to set |
|
And I think, if we need to just EDIT: |
|
@jnothman I have added the global clustering step, IRL me and @agramfort discussed about the global clustering step, right now we have hardcoded it and allowing the user to provide |
I would like it to stay as it is because of the |
|
Sounds good. I'll look through this later. Is it worth adding the option that |
|
We had thought about it, but then how do you account for the parameters of the arbitrary_clusterer? |
👍 good idea. |
You give an instance, not an object. |
I see, so I guess we are in the |
No. This is good design (I meant a "not a class", by the way, in my |
|
Yes of course, I was asking if the next step would be to profile the code, after I make that change. |
|
We need some work to speed it up, This example takes 1.24s with |
|
Though it is much faster than AgglomerativeClustering with the default parameters (which takes 6s) @GaelVaroquaux Is this expected? |
|
yes. AgglomerativeClustering is quadratic in sample size (unless you use |
|
Yes, the fact of setting threshold to zero initially, and then building a new tree based on the subclusters read from the leaves, and fitting a new tree using these subclusters as new samples (when a memory limit is set by the user), is interesting but I'm not sure how useful or practical this would be, but yes it might be worth a go later on.
Do you mean initially or when memory runs out. If you mean initially then we come back to Gael's comments about taking a small sample of the data, find the pairwise distances under a quantile percent. If you mean when the memory runs out, there are a number of heuristics in page 7 of the paper that sets a new threshold, based on seen data, radius of the subclusters in the tree and so on,
All right, thanks! |
sklearn/cluster/birch.py
Outdated
|
Once those minor things are addressed, I expect this will have my +1. |
|
@jnothman I have addressed all your comments in the last commit, I have also added your name in the author's file, for your help in reviewing this |
1. Made radius a property. 2. Added test for compute_label. 3. Minor changes to doc and split_subcluster. 4. n_cluster -> clusterer.
|
It was a problem with |
|
@jnothman I can haz merge? |
|
I think so! Well done! |
|
Congrats ! |
|
@jnothman @agramfort Thanks for bearing with my impatience. Now is just the simple matter of rewriting it in Cython! |
Lol. No, the tricky part is writing as little of it as possible in Python On 12 December 2014 at 09:11, Alexandre Gramfort notifications@github.com
|
Hurray! |
Fixes #2690
The design is similar to the Java code written here https://code.google.com/p/jbirch/
I am pretty much sure it works (If the JAVA implementation is correct, ofc), since I get the same clusters for both cases. I opened this as a Proof of Concept.
This example has been modified, http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html
When
thresholdis set to 3.0When
thresholdis set to 1.0TODO: A LOT
dont_test;)Awating some initial feedback!