[MRG+1] Clustering algorithm - BIRCH by MechCoder · Pull Request #3802 · scikit-learn/scikit-learn

MechCoder · 2014-10-24T16:42:20Z

The design is similar to the Java code written here https://code.google.com/p/jbirch/
I am pretty much sure it works (If the JAVA implementation is correct, ofc), since I get the same clusters for both cases. I opened this as a Proof of Concept.

This example has been modified, http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html

When threshold is set to 3.0

When threshold is set to 1.0

TODO: A LOT

Make it as fast as possible.
Support for sparse matrices.
Extensive testing.
Make the common tests pass, for now I have added it to dont_test ;)
Narrative documentation

Awating some initial feedback!

agramfort · 2014-10-24T20:14:26Z

support for new metrics is low priority. The idea of sum of squares as sufficient stats is designed for euclidean norm.

support for sparse matrices should be straightforward. I don't see why it would not work.

GaelVaroquaux · 2014-10-24T20:45:42Z

sklearn/cluster/birch.py

Minor style remark: the name should be new_subcluster1 :).

GaelVaroquaux · 2014-10-24T21:08:20Z

OVerall, I am impressed that you managed to put this together that fast.
This is great, and this is absolutely going in the right direction.
Congratulation!

Making it fast will probably require porting part of the code to Cython.
To probably calls for profiling before starting this endeavour.

MechCoder · 2014-10-24T22:09:53Z

Thanks. It is always nice to be appreciated.

GaelVaroquaux · 2014-10-24T22:24:00Z

As a general comment, before going low level to speed things up, I think that it would help having a high-level reflection on the structure of the code:

There are two classes. Can these 2 classes be collapsed into one? Less levels of indirection make it easier to profile and optimize. It also makes simpler Cython code.

What is the best way to store the data structures? We have here lists of objects, that have attributes that store the interesting information. Can we have lists / arrays of these interesting info? (unlikely, but worth thinking).

To think about that, I would make a diagram with all the data-structures layed out, and what they contribute to the algorithm.

jnothman · 2014-10-25T14:21:20Z

Yes, partial_fit (or partial_fit_predict?) support would be wonderful, but I'm not sure that the current solution of storing labels for every training instance makes sense in a large-scale clustering approach. A fit-predict paradigm could make more sense, or could be provided as an option (and this means you don't need to track sample_indices_, but have an additional O(nlogn) predict time.
The paper describes a global clustering step in which an arbitrary clustering algorithm is applied to the identified subclusters. It seems this is not presently included. Is it possible to provide a meta-estimator or some other kind of data reduction transformation that allows subcluster centroids to then be clustered and those global-cluster labels returned from predict for the subcluster nearest each sample? (Note also that BIRCH is coming very close to some of the instance reduction techniques that have been mooted on the ML.)
Does adding __slots__ to every CFNode/CFSubcluster improve its profile?
I have thought about how to represent this, and it is not really very numpy friendly, in that I don't see a clean way for arrays to be used to represent the node space. Without tracking sample_indices_ (as above), it may be possible to have CFNode.subcluters_ as an array and hence vectorize the update step.

MechCoder · 2014-10-26T11:03:14Z

@jnothman @agramfort Thanks for all your comments. I shall give detailed response tomorrow.

MechCoder · 2014-10-27T13:12:55Z

I am responding to @jnothman 's major comment (and in a way to @GaelVaroquaux 's too) .

Yes, I did not include the global clustering step, in which the leaf sub-clusters are grouped back (according to an arbitrary clustering algorithm). In that way

In the fit method we can build the tree (and the global clusters) and in the predict method we can map the samples to the respective closest global clusters.
Do away with the tracking of sample_indices_ and collapse CFSubluster to 3 separate arrays
It also gives more sense to a partial_fit method.

However I am not sure how are we to accommodate this in the public API. Or do we just allow the user to give an option to set n_clustersand hardcode the final global clustering ourselves?

MechCoder · 2014-10-27T13:48:16Z

And I think, if we need to just euclidean metrics for now (according to @agramfort 's comment) , we can also do away with computing ss

EDIT:
It should not be too much work in any case, I can work that out once we come to a conclusion on how to handle the global clustering step.

MechCoder · 2014-10-27T22:19:01Z

@jnothman I have added the global clustering step, IRL me and @agramfort discussed about the global clustering step, right now we have hardcoded it and allowing the user to provide n_clusters. If n_clusters is None then we skip the global clustering step.
I have also put a partial_fit method to avoid the rebuilding of the CF Tree and inherited from Cluster Mixin for the fit_predict method.

MechCoder · 2014-10-27T22:22:30Z

it may be possible to have CFNode.subcluters_ as an array

I would like it to stay as it is because of the child_ param. IMO, it is now more readable.

jnothman · 2014-10-27T23:01:03Z

Sounds good. I'll look through this later. Is it worth adding the option that n_clusters=arbitrary_clusterer?

MechCoder · 2014-10-27T23:03:23Z

We had thought about it, but then how do you account for the parameters of the arbitrary_clusterer?

GaelVaroquaux · 2014-10-27T23:14:39Z

Sounds good. I'll look through this later. Is it worth adding the option that
n_clusters=arbitrary_clusterer?

👍 good idea.

GaelVaroquaux · 2014-10-27T23:14:55Z

We had thought about it, but then how do you account for the parameters of the
arbitrary_clusterer?

You give an instance, not an object.

MechCoder · 2014-10-27T23:16:24Z

You give an instance, not an object.

I see, so I guess we are in the make it fast stage? ;)

GaelVaroquaux · 2014-10-27T23:24:05Z

I see, so I guess we are in the make it fast stage? ;)

No. This is good design (I meant a "not a class", by the way, in my
previous post).

MechCoder · 2014-10-27T23:28:10Z

Yes of course, I was asking if the next step would be to profile the code, after I make that change.

MechCoder · 2014-10-28T12:15:48Z

We need some work to speed it up, This example takes 1.24s with n_clusters=3 as compared to 0.02s with MiniBatchKMeans .

MechCoder · 2014-10-28T12:23:29Z

Though it is much faster than AgglomerativeClustering with the default parameters (which takes 6s) @GaelVaroquaux Is this expected?

agramfort · 2014-10-28T14:40:59Z

yes. AgglomerativeClustering is quadratic in sample size (unless you use
connectivity)

MechCoder · 2014-12-08T12:08:00Z

@jnothman

Yes, the fact of setting threshold to zero initially, and then building a new tree based on the subclusters read from the leaves, and fitting a new tree using these subclusters as new samples (when a memory limit is set by the user), is interesting but I'm not sure how useful or practical this would be, but yes it might be worth a go later on.

Is there any literature on automatically setting the threshold?

Do you mean initially or when memory runs out. If you mean initially then we come back to Gael's comments about taking a small sample of the data, find the pairwise distances under a quantile percent. If you mean when the memory runs out, there are a number of heuristics in page 7 of the paper that sets a new threshold, based on seen data, radius of the subclusters in the tree and so on,

Sorry, it's a fairly full week

All right, thanks!

jnothman · 2014-12-09T03:54:13Z

sklearn/cluster/birch.py

I think you mean init_sq_norm_

jnothman · 2014-12-09T04:33:06Z

Once those minor things are addressed, I expect this will have my +1.

MechCoder · 2014-12-09T12:56:49Z

@jnothman I have addressed all your comments in the last commit, I have also added your name in the author's file, for your help in reviewing this

1. Made radius a property. 2. Added test for compute_label. 3. Minor changes to doc and split_subcluster. 4. n_cluster -> clusterer.

MechCoder · 2014-12-09T15:39:14Z

It was a problem with random_state, I moved the test to common_tests

MechCoder · 2014-12-11T20:00:15Z

@jnothman I can haz merge?

jnothman · 2014-12-11T22:07:14Z

I think so! Well done!

agramfort · 2014-12-11T22:11:26Z

Congrats !

MechCoder · 2014-12-11T22:12:45Z

@jnothman @agramfort Thanks for bearing with my impatience. Now is just the simple matter of rewriting it in Cython!

jnothman · 2014-12-11T22:48:27Z

Now is just the simple matter of rewriting it in Cython!

Lol. No, the tricky part is writing as little of it as possible in Python
while keeping it clean and fast.

On 12 December 2014 at 09:11, Alexandre Gramfort notifications@github.com
wrote:

Congrats !

—
Reply to this email directly or view it on GitHub
#3802 (comment)
.

GaelVaroquaux · 2014-12-12T05:56:19Z

Congrats !

Hurray!

GaelVaroquaux reviewed Oct 24, 2014
View reviewed changes

sklearn/cluster/birch.py Outdated

Copy link
Copy Markdown

Member

GaelVaroquaux Oct 24, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style remark: the name should be new_subcluster1 :).

MechCoder force-pushed the birch branch from e1b9b8d to d051331 Compare October 27, 2014 22:09

MechCoder force-pushed the birch branch from d051331 to df32469 Compare October 27, 2014 22:42

MechCoder changed the title ~~[WIP] Clustering algorithm - BIRCH (Proof of Concept)~~ [WIP] Clustering algorithm - BIRCH Oct 27, 2014

MechCoder mentioned this pull request Oct 28, 2014

Prevent repeated checking and typecasting in check_pairwise #3807

Closed

jnothman reviewed Dec 9, 2014
View reviewed changes

sklearn/cluster/birch.py Outdated

Copy link
Copy Markdown

Member

jnothman Dec 9, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean init_sq_norm_

MechCoder added 2 commits December 9, 2014 14:48

Made the following changes

b9474d0

1. Made radius a property. 2. Added test for compute_label. 3. Minor changes to doc and split_subcluster. 4. n_cluster -> clusterer.

MAINT: Moved compute_label test to common tests

f86622d

MechCoder mentioned this pull request Dec 12, 2014

[MRG] ENH: Use check_X_y in pairwise_distances_argmin #3962

Merged

Uh oh!

Conversation

MechCoder commented Oct 24, 2014

Uh oh!

agramfort commented Oct 24, 2014

Uh oh!

GaelVaroquaux Oct 24, 2014

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Oct 24, 2014

Uh oh!

MechCoder commented Oct 24, 2014

Uh oh!

GaelVaroquaux commented Oct 24, 2014

Uh oh!

jnothman commented Oct 25, 2014

Uh oh!

MechCoder commented Oct 26, 2014

Uh oh!

MechCoder commented Oct 27, 2014

Uh oh!

MechCoder commented Oct 27, 2014

Uh oh!

MechCoder commented Oct 27, 2014

Uh oh!

MechCoder commented Oct 27, 2014

Uh oh!

jnothman commented Oct 27, 2014

Uh oh!

MechCoder commented Oct 27, 2014

Uh oh!

GaelVaroquaux commented Oct 27, 2014

Uh oh!

GaelVaroquaux commented Oct 27, 2014

Uh oh!

MechCoder commented Oct 27, 2014

Uh oh!

GaelVaroquaux commented Oct 27, 2014

Uh oh!

MechCoder commented Oct 27, 2014

Uh oh!

MechCoder commented Oct 28, 2014

Uh oh!

MechCoder commented Oct 28, 2014

Uh oh!

agramfort commented Oct 28, 2014

Uh oh!

MechCoder commented Dec 8, 2014

Uh oh!

jnothman Dec 9, 2014

Choose a reason for hiding this comment

Uh oh!

jnothman commented Dec 9, 2014

Uh oh!

MechCoder commented Dec 9, 2014

Uh oh!

MechCoder commented Dec 9, 2014

Uh oh!

MechCoder commented Dec 11, 2014

Uh oh!

jnothman commented Dec 11, 2014

Uh oh!

agramfort commented Dec 11, 2014

Uh oh!

MechCoder commented Dec 11, 2014

Uh oh!

jnothman commented Dec 11, 2014

Uh oh!

GaelVaroquaux commented Dec 12, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants