[MRG] Add an example of inductive clustering#10852
[MRG] Add an example of inductive clustering#10852jnothman merged 6 commits intoscikit-learn:masterfrom
Conversation
|
ping @jnothman |
qinhanmin2014
left a comment
There was a problem hiding this comment.
Please try to make Circle CI green first (you might refer to Circle CI log and other examples).
| """ | ||
| ============================================== | ||
| Inductive Clustering | ||
| ============================================== |
There was a problem hiding this comment.
A blank line here I guess?
There was a problem hiding this comment.
Yeap, it might be that.
There was a problem hiding this comment.
Thanks for this example.
I just have some nitpicks, especially concerning the bad plot rendering.
Once nitpicks are addressed, I am +1 to merge this
| self.classifier_.fit(X, y) | ||
| return self | ||
|
|
||
| @if_delegate_has_method(delegate='classifier') |
There was a problem hiding this comment.
should be with an underscore: @if_delegate_has_method(delegate='classifier_')
| def predict(self, X): | ||
| return self.classifier_.predict(X) | ||
|
|
||
| @if_delegate_has_method(delegate='classifier') |
| return self.classifier_.decision_function(X) | ||
|
|
||
|
|
||
| def plot_scatter(X, color, alpha=0.5): |
There was a problem hiding this comment.
That was on purpose. Shouldn't we have double space between two functions as per PEP8?
| plt.subplot(133) | ||
| plot_scatter(X, cluster_labels) | ||
| plot_scatter(X_new, probable_clusters) | ||
| plt.title("Inductive inference on cluster membership \n" |
There was a problem hiding this comment.
The title is too long for the plot.
Please make sure the final plot is readable with the chosen figure size.
There was a problem hiding this comment.
I shorted the titles
| clusterer = AgglomerativeClustering(n_clusters=3) | ||
| cluster_labels = clusterer.fit_predict(X) | ||
|
|
||
| plt.subplot(131) |
There was a problem hiding this comment.
Please add before this line plt.figure(figsize=(12, 4)) to specify a figure size.
(and make sure the size is good)
| # Declare the inductive learning model that it will be used to | ||
| # predict cluster membership for unknown instances | ||
| classifier = RandomForestClassifier(random_state=RANDOM_STATE) | ||
| inductiveLearner = InductiveClusterer(clusterer, classifier).fit(X) |
There was a problem hiding this comment.
Please use underscores, not camelCase for local variables
|
|
||
|
|
||
| # Generate new samples and plot them along with the original dataset | ||
| X_new, y_new = make_blobs(n_samples=10, |
There was a problem hiding this comment.
Hmm... Should we be drawing samples from a completely different distribution, rather than drawing a test set from the same generation procedure (or even real-world data)?
There was a problem hiding this comment.
I don't have a strong opinion about that. I think that the intention of the example is clearly provided. If you want something more specific I am all ears.
| Clustering is expensive, especially when our dataset contains millions of | ||
| datapoints. Recomputing the clusters everytime we receive some new data | ||
| is thus in many cases, intractable. With more data, there is also the | ||
| possibility of degrading the previous clustering. |
There was a problem hiding this comment.
I think there is less of an issue with degrading, than with identifying the clusters across two clusterings.
For that reason and others, this kind of technique is interesting regardless of the size of the dataset. An algorithm like agglomerative clustering or dbscan makes no hypothesis about how to divide the data in terms of features. Learning a classifier may also help us make inferences about the nature of the clustering. For this reason, I think we should aim to plot the decision boundary in the plot below
There was a problem hiding this comment.
For that reason and others, this kind of technique is interesting regardless of the size of the dataset.
@jnothman I agree. I kept the docstring from the original PR.
An algorithm like agglomerative clustering or dbscan makes no hypothesis about how to divide the data in terms of features. Learning a classifier may also help us make inferences about the nature of the clustering. For this reason, I think we should aim to plot the decision boundary in the plot below
I agree again. I have plotted the decision regions in the third plot. What do you think?
There was a problem hiding this comment.
The decision regions are helpful. Please also update the description to better match real use cases.
|
One problem we have here is that Broadly, I like this example, I just wish it was clearer on the inferential value of such an approach, not merely its application to new data. |
|
I've decided this is a nice example of both the technique and meta-estimator design, and would like to merge it. |
|
@jnothman great! |
|
Thanks @chkoar! Nice to clear out some cobwebs. |
This reverts commit 534090c.
This reverts commit 534090c.
Reference Issues/PRs
Resolves #4587. Continues #6478.
What does this implement/fix? Explain your changes.
This PR adds an example about how to implement and perform inductive inference on cluster memberships by using a classifier that is trained on cluster labels.