[MRG+1] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines by amueller · Pull Request #4064 · scikit-learn/scikit-learn

amueller · 2015-01-08T02:19:34Z

Also fixes the embarrassing bug where clustering algorithms can't be used in pipelines.

jnothman · 2015-01-08T02:27:46Z

Okay, I guess that's one option; the alternative is to not pass y if it's None, as is done in GridSearchCV for fit.

amueller · 2015-01-08T02:31:09Z

I am just now a bit confused. Pipeline always passes y=None to fit. Does that mean many clustering algorithms don't work with pipeline?!
Currently everything but the fit in some clustering algorithms and the score in KMeans has an optional y=None.

amueller · 2015-01-08T02:32:03Z

we should do it consistently, but I guess it is to late for that now?

amueller · 2015-01-08T02:32:27Z

I actually prefer the GridSearchCV approach, but it looks like we went the other way until now.

amueller · 2015-01-08T02:33:29Z

import numpy as np,sklearn
from sklearn.pipeline import Pipeline
from sklearn.cluster import MeanShift
X=np.random.rand(100,2)
p=Pipeline([('clf',MeanShift())])
p.fit(X)

breaks and makes me sad.

amueller · 2015-01-08T02:35:18Z

For the pipeline, it could be that previous steps in the pipeline use y so that it is not None, though. So you could do LDA, and then do clustering.

jnothman · 2015-01-08T03:13:29Z

Don't be sad. As Gaël likes to say: it's not version 1 yet, stuff's allowed
to break and change :)

I can't say I have the answer here, yet, but I think you're thinking about
it the right way. Do we really require all estimators to take 2 positional
args to fit, then?

On 8 January 2015 at 13:35, Andreas Mueller notifications@github.com
wrote:

For the pipeline, it could be that previous steps in the pipeline use y
so that it is not None, though. So you could do LDA, and then do
clustering.

—
Reply to this email directly or view it on GitHub
#4064 (comment)
.

amueller · 2015-01-08T03:15:43Z

Yeah, I think this is the only way to make pipeline work as expected. Testing y is None is not good as in my example. The other option would be to use introspection or try but that is not the approach we have taken so far.

amueller · 2015-01-20T20:43:39Z

reviews on this one would be nice as it makes testing somewhat simpler.

jnothman · 2015-01-21T00:22:23Z

sklearn/utils/estimator_checks.py

You need to add a warnings.simplefilter('always') here so that the registry doesn't catch some warnings and create heisenbugs... Better, decorate the whole test with @ignore_warnings

that is not particular to this test, though, right? and I'm not sure the whole tests should be decorated. These are more for the deprecation warnings.

I think it should be reasonable to ignore all sklearn warnings, at least, in common tests. Perhaps we should change ignore_warnings to only apply to sklearn modules.

Interestingly enough, if I put @ignore_warnings on the test (not the check function) not test is run :-/

I thought about adding a test for partial_fit, but I'm not sure about it. The pipelines need to look quite a bit different to support partial_fit, and there is no pressing reason currently to enforce taking a y in the API....

amueller · 2015-01-26T19:11:17Z

This one will simplify future testing such as #4162 and #4136 by the way....

jnothman · 2015-01-27T23:21:28Z

Firstly, doc/developers/index.rst needs fixing. Secondly, I think this change should encompass partial_fit: RBM, LSHForest, and IPCA are current offenders.

amueller · 2015-01-28T01:47:58Z

What is the benefit in partial_fit apart from making the tests easier to write?

jnothman · 2015-01-28T02:16:05Z

What is the benefit in partial_fit apart from making the tests easier to write?

Consistency with fit, for one; the polymorphic design principle that Gaël expressed elsewhere; the ability to remove an if-else clause in sklearn.learning_curve; and anticipation of meta-estimators employing partial_fit in the future or in external projects where the developer may otherwise forget the unsupervised case...

amueller · 2015-01-28T15:43:38Z

Fair enough.
Where would you put this in the developers doc? The current y argument of Transformers is not documented anywhere as far as I can see :-/

amueller · 2015-01-28T15:53:37Z

Actually, there was a comment about fit in there, something that was not tested / enforced. I added comments about transform and score but I'm not sure that is the best place.

jnothman · 2015-01-29T03:05:59Z

Oh, I didn't look at it closely enough. It actually gives an invocation
spec, not a def signature. But maybe it should:
http://scikit-learn.org/dev/developers/index.html#different-objects

On 29 January 2015 at 10:32, Andreas Mueller notifications@github.com
wrote:

Where?

—
Reply to this email directly or view it on GitHub
#4064 (comment)
.

jnothman · 2015-01-29T03:06:19Z

Obscurely, it shows 1-ary score as an example.

On 29 January 2015 at 14:05, Joel Nothman joel.nothman@sydney.edu.au
wrote:

Oh, I didn't look at it closely enough. It actually gives an invocation
spec, not a def signature. But maybe it should:
http://scikit-learn.org/dev/developers/index.html#different-objects

On 29 January 2015 at 10:32, Andreas Mueller notifications@github.com
wrote:

Where?

—
Reply to this email directly or view it on GitHub
#4064 (comment)
.

jnothman · 2015-01-29T03:07:22Z

doc/developers/index.rst

Indeed, this doc was already here. I guess I only looked at a glance at what I thought was an issue, but not really.

jnothman · 2015-01-29T03:09:04Z

Great. This LGTM apart from your test failure, which I take it will be fixed before the next reviewer gives a +1 ;)

amueller · 2015-01-29T03:33:26Z

Thanks for the review. Damn, I thought I just fixed travis. Will look at it again.

amueller · 2015-01-29T03:44:25Z

Fixed.

amueller · 2015-02-03T16:28:14Z

Any other reviews? @ogrisel maybe?

ogrisel · 2015-02-05T14:40:03Z

As the motivation of this change was to make it possible to call the clustering models in pipelines I would like to integrate a non-regression "integration" test in the spirit of #4064 (comment) in test_common. Note: clf is a bad name for such as step, model or estimator would be a better name. Or even, just use make_pipeline for that test.

ogrisel · 2015-02-05T14:40:23Z

Other than that +1 on my side.

amueller · 2015-02-05T23:27:43Z

Hum, I totally wanted to do that integration test with pipelines, seems I forgot about it. Not sure what should be tested, probably "everything". This PR has tests for the "API definition' of Pipeline, not doing the integration.
I'll try that tomorrow.

ogrisel · 2015-02-06T11:32:33Z

I was thinking of a bunch of sanity checks such as:

for est in all_classifiers():
   p = make_pipeline(est()).fit(X, y)
   e = est().fit(X, y)
   assert_equal(e.score(X, y), p.score(X, y))
   assert_array_equal(e.predict(X), p.predict(X))

and the similar clustering and regressors and maybe others as long as they implement fit and score.

amueller · 2015-02-06T11:42:36Z

I thought about which functions we want to test, I guess that this issue only exists for fit and score so your tests should be fine (plus a if hasattr(est, "score"). I thought about whether we want to test decision_function, predict_proba and predict, too.

amueller · 2015-02-06T11:46:34Z

~~at least transform should also be checked~~ not enough coffee apparently, transform never accepts y

coveralls · 2015-02-06T12:22:11Z

Coverage increased (+0.01%) to 95.01% when pulling a105327 on amueller:pipeline_scoring_y_none into a168a6c on scikit-learn:master.

amueller · 2015-02-06T12:23:06Z

added integration test, rebased. ~~And now I'm not convinced that this is the right fix any more :-/ Maybe not passing y if it is None would be the better fix? And removing the y=None everywhere?~~ Edit: this is the right fix, just not enough coffee.

I am a bit confused.
I just did
`> git grep "transform(self, X, y=None)" | grep -v "fit_transform"``
which yields

sklearn/cluster/birch.py:    def transform(self, X, y=None):
sklearn/cluster/k_means_.py:    def transform(self, X, y=None):
sklearn/decomposition/base.py:    def transform(self, X, y=None):
sklearn/decomposition/base.py:    def inverse_transform(self, X, y=None):
sklearn/decomposition/dict_learning.py:    def transform(self, X, y=None):
sklearn/decomposition/pca.py:    def transform(self, X, y=None):
sklearn/decomposition/pca.py:    def inverse_transform(self, X, y=None):
sklearn/feature_extraction/dict_vectorizer.py:    def transform(self, X, y=None):
sklearn/feature_extraction/text.py:    def transform(self, X, y=None):
sklearn/kernel_approximation.py:    def transform(self, X, y=None):
sklearn/kernel_approximation.py:    def transform(self, X, y=None):
sklearn/kernel_approximation.py:    def transform(self, X, y=None):
sklearn/neighbors/approximate.py:    def transform(self, X, y=None):
sklearn/preprocessing/data.py:    def transform(self, X, y=None):
sklearn/random_projection.py:    def transform(self, X, y=None):
sklearn/tests/test_pipeline.py:    def transform(self, X, y=None):

That y is never used anywhere and not required by any API, is it?

…f clustering algorithms!

amueller · 2015-02-06T12:27:47Z

Anyhow, this PR is good to go, I think. I removed the requirement on transform (which is not a requirement at all) from the developer docs.

coveralls · 2015-02-06T12:36:47Z

Coverage increased (+0.01%) to 95.01% when pulling cd63acc on amueller:pipeline_scoring_y_none into a168a6c on scikit-learn:master.

ogrisel · 2015-02-06T18:22:37Z

Great, merging this PR. Thanks again @amueller !

[MRG+1] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines

amueller · 2015-02-07T07:59:30Z

Thanks for the reviews @ogrisel and @jnothman :)

jnothman · 2015-02-07T10:43:31Z

Edit: this is the right fix, just not enough coffee.

:)

Rest assured this fix is in accordance with what's been in the docs for some time...

amueller mentioned this pull request Jan 8, 2015

Pipeline score broken for unsupervised algorithms #4063

Closed

amueller changed the title ~~[MRG] Pipeline scoring y none~~ [MRG] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines Jan 9, 2015

amueller force-pushed the pipeline_scoring_y_none branch 2 times, most recently from 9f2ec74 to f89c72f Compare January 15, 2015 16:02

amueller added the Bug label Jan 16, 2015

amueller added this to the 0.16 milestone Jan 16, 2015

amueller force-pushed the pipeline_scoring_y_none branch from f89c72f to d55d272 Compare January 20, 2015 20:43

jnothman reviewed Jan 21, 2015
View reviewed changes

amueller force-pushed the pipeline_scoring_y_none branch 3 times, most recently from 27ddff9 to f02335a Compare January 22, 2015 21:17

amueller mentioned this pull request Jan 26, 2015

[WIP] TST Add test to check if estimators reset model when fit is called #4162

Closed

This was referenced Jan 27, 2015

Allow for Transformers on y #4143

Closed

[WIP] Adding tests for estimators implementing partial_fit and a few other related fixes / enhancements #3907

Closed

jnothman reviewed Jan 29, 2015
View reviewed changes

jnothman changed the title ~~[MRG] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines~~ [MRG+1] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines Jan 29, 2015

amueller force-pushed the pipeline_scoring_y_none branch from e37f56a to fa184fe Compare January 29, 2015 03:43

ghost mentioned this pull request Jan 30, 2015

Signature of MDS fit #4132

Closed

amueller force-pushed the pipeline_scoring_y_none branch from cfee0a1 to a105327 Compare February 6, 2015 12:17

Add test that score takes y, fix KMeans, FIX pipeline compatibility o…

cd63acc

…f clustering algorithms!

amueller force-pushed the pipeline_scoring_y_none branch from a105327 to cd63acc Compare February 6, 2015 12:26

ogrisel added a commit that referenced this pull request Feb 6, 2015

Merge pull request #4064 from amueller/pipeline_scoring_y_none

620301b

[MRG+1] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines

ogrisel merged commit 620301b into scikit-learn:master Feb 6, 2015

amueller mentioned this pull request Feb 7, 2015

[MRG+1] More robust input validation, more testing. #4136

Merged

amueller deleted the pipeline_scoring_y_none branch May 19, 2017 20:46

Uh oh!

Conversation

amueller commented Jan 8, 2015

Uh oh!

jnothman commented Jan 8, 2015

Uh oh!

amueller commented Jan 8, 2015

Uh oh!

amueller commented Jan 8, 2015

Uh oh!

amueller commented Jan 8, 2015

Uh oh!

amueller commented Jan 8, 2015

Uh oh!

amueller commented Jan 8, 2015

Uh oh!

jnothman commented Jan 8, 2015

Uh oh!

amueller commented Jan 8, 2015

Uh oh!

amueller commented Jan 20, 2015

Uh oh!

jnothman Jan 21, 2015

Choose a reason for hiding this comment

Uh oh!

amueller Jan 21, 2015

Choose a reason for hiding this comment

Uh oh!

jnothman Jan 21, 2015

Choose a reason for hiding this comment

Uh oh!

amueller Jan 21, 2015

Choose a reason for hiding this comment

Uh oh!

amueller Jan 21, 2015

Choose a reason for hiding this comment

Uh oh!

amueller commented Jan 26, 2015

Uh oh!

jnothman commented Jan 27, 2015

Uh oh!

amueller commented Jan 28, 2015

Uh oh!

jnothman commented Jan 28, 2015

Uh oh!

amueller commented Jan 28, 2015

Uh oh!

amueller commented Jan 28, 2015

Uh oh!

jnothman commented Jan 29, 2015

Uh oh!

jnothman commented Jan 29, 2015

Uh oh!

jnothman Jan 29, 2015

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 29, 2015

Uh oh!

amueller commented Jan 29, 2015

Uh oh!

amueller commented Jan 29, 2015

Uh oh!

amueller commented Feb 3, 2015

Uh oh!

ogrisel commented Feb 5, 2015

Uh oh!

ogrisel commented Feb 5, 2015

Uh oh!

amueller commented Feb 5, 2015

Uh oh!

ogrisel commented Feb 6, 2015

Uh oh!

amueller commented Feb 6, 2015

Uh oh!

amueller commented Feb 6, 2015

Uh oh!

coveralls commented Feb 6, 2015

Uh oh!

amueller commented Feb 6, 2015

Uh oh!