MRG remove unreachable code from grid_search, test unsupervised setting by amueller · Pull Request #1210 · scikit-learn/scikit-learn

amueller · 2012-10-06T19:43:15Z

This significantly increases test-coverage.
Motivated by comments on my randomized search PR.

We should try to get 100% test coverage here. This is really the core of sklearn.

agramfort · 2012-10-07T07:42:15Z

LGTM

GaelVaroquaux · 2012-10-14T17:41:52Z

sklearn/grid_search.py

I am not sure: why did you remove these lines. Them seem to be useful to me.

hm... you mean in the case where the estimator doesn't have a score function but a loss function was specified (and not a score function)?
That is true...
"unfortunately" all estimators have a score-function... I'll have another look again. I'm not sure how it is possible to use the two lines below....

Maybe I should have waited for #1198... the current GridSearchCV contains some magic...

Ok, so actually, I should remove even more.
As all estimators define score, this code doesn't make any sense.
How would you use it?

It used to be the case that score was not part of the required API: if you just implement your own fit / predict estimator you might have expected to be able to use the grid search tools on it.

On 10/14/2012 08:25 PM, Olivier Grisel wrote:

In sklearn/grid_search.py:

@@ -449,9 +442,5 @@ def fit(self, X, y):
def score(self, X, y=None):
if hasattr(self.best_estimator, 'score'):
return self.best_estimator_.score(X, y)

if self.score_func is None:

raise ValueError("No score function explicitly defined, "

"and the estimator doesn't provide one %s"

% self.best_estimator_)

It used to be the case that score was not part of the required API: if
you just implement your own fit / predict estimator you might have
expected to be able to use the grid search tools on it.

—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1210/files#r1837718.

But this is only the case if you didn't inherit from ClassifierMixin
or RegressorMixin.
I though that was the minimum requirement for the API.

If this is not the case, then I can add some dummy classes to test this
case.
The question is: do we really want to support it.

On Sun, Oct 14, 2012 at 11:43:32AM -0700, Andreas Mueller wrote:

"unfortunately" all estimators have a score-function...

Awesome! I am surprised, but happily surprised. Even the clustering ones?

But I guess that our code should also work with estimators that we did
not design. These might not have a score method.

On Sun, Oct 14, 2012 at 12:25:50PM -0700, Olivier Grisel wrote:

It used to be the case that score was not part of the required API: if you just
implement your own fit / predict estimator you might have expected to be able
to use the grid search tools on it.

In my mind, it still is the case.

On Sun, Oct 14, 2012 at 12:44:38PM -0700, Andreas Mueller wrote:

But this is only the case if you didn't inherit from ClassifierMixin or
RegressorMixin. I though that was the minimum requirement for the API.

In my mind, it is very important not to require inheritence to use the
scikit-learn. The reason being that it should be possible to be
scikit-learn compliant without having a dependency on it.

On 10/15/2012 06:32 AM, Gael Varoquaux wrote:

In sklearn/grid_search.py:

@@ -449,9 +442,5 @@ def fit(self, X, y):
def score(self, X, y=None):
if hasattr(self.best_estimator, 'score'):
return self.best_estimator_.score(X, y)

if self.score_func is None:

raise ValueError("No score function explicitly defined, "

"and the estimator doesn't provide one %s"

% self.best_estimator_)
On Sun, Oct 14, 2012 at 11:43:32AM -0700, Andreas Mueller wrote:
"unfortunately" all estimators have a score-function...
Awesome! I am surprised, but happily surprised. Even the clustering
ones? But I guess that our code should also work with estimators that
we did not design. These might not have a score method.

—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1210/files#r1838742.

I can't find these comments anywhere on github... so now an unordered
email reply.
Most clustering algorithms can't have a score, because they don't
support predict.

GaelVaroquaux · 2012-10-14T17:42:25Z

Apart from the comment on the removed lines, this looks good to me.

amueller · 2012-10-14T20:06:52Z

Ok so one more comment on the unreachable code:
The code I removed would be useful for an estimator that defines predict, but not score, and doesn't inherit from ClassifierMixin or RegressorMixin.

If we want to support that, I can add a dummy estimator that has these properties and test it.
I'd rather remove this "feature" if no-one has a use-case.

Somewhat related: I think it would be good to have a separate class to do model selection for unsupervised models, as I think people usually wouldn't use cross-validation there (most unsupervised models in sklearn don't have a predict making cross-validation impossible).

wdyt @GaelVaroquaux @ogrisel ?

ogrisel · 2012-10-14T20:14:27Z

I have no strong opinion for supporting estimators without a score function.

For the unsupervised case I agree this should better be dealt with using dedicated classes. For instance clustering could be evaluated with the stability selection / consensus index method but this requires a special kind of CV iterator with overlapping folds and we should not render the existing GS class more complex to support this.

ogrisel · 2012-10-14T20:16:09Z

sklearn/tests/test_grid_search.py

I don't think it makes sense to use the inertia (the default score method for KMeans) as a way to select the number of clusters. inertia is to be minimized for an a priori fixed number of clusters. If the number of clusters increases, inertia will always decrease hence the best model will always be n_clusters=5 in this case, whatever the data.

I know. See the discussion with @larsmans that github doesn't show here.

This is just a smoke test.
I could remove the case of unsupervised grid-search altogether.
I don't see a better way of testing it.

amueller · 2012-10-14T20:39:02Z

I just noticed that estimators have to define score for gridsearch for #1198.
I'll cherry pick the simple stuff from here into master and maybe we can move the discussion there.

amueller · 2012-10-14T20:46:03Z

Sorry for the confusion. I thought it would be easy to merge some parts of #1198 but apparently that is not the case.

GaelVaroquaux · 2012-10-15T05:39:14Z

On Sun, Oct 14, 2012 at 01:06:58PM -0700, Andreas Mueller wrote:

The code I removed would be useful for an estimator that defines predict, but
not score, and doesn't inherit from ClassifierMixin or RegressorMixin.

Yes!

If we want to support that, I can add a dummy estimator that has these
properties and test it.
I'd rather remove this "feature" if no-one has a use-case.

I'd rather not remove this feature, as it creates heavier requirements on
the API and tigther coupling; both of which I believe are bad things from
a design point of view.

Somewhat related: I think it would be good to have a separate class to
do model selection for unsupervised models, as no cross-validation is
needed there....

I agree with you that for a different model selection scheme, a separate
class would be good. I am just wondering: which model selection scheme do
you think would be good? I am not sure that I know of a universal one.

amueller · 2012-10-15T05:54:50Z

I think the API-requirements for doing grid-search are quite strong already and no-one should attempt to use it without inheriting from the sklearn estimators.

It relies heavily on a working clone, get_params and set_params. Do you expect that people will reimplement this themselves? This is not really documented btw.
Maybe we should write some documentation for how to write your own estimators?

In my mind, to be able to grid-search, you inherit. How else?

amueller · 2012-10-15T06:17:25Z

About requiring to have a score function: in #1198, the requirement is to have a score method and that the score method needs to support a score_func argument.

IIRC, it was your idea to handle the score function in the estimator. I could make it such that the standard scores, like zero_one_score don't need a score function, but auc_score does. That wouldn't be hard, though bloating the code a bit.

amueller · 2012-10-15T06:40:31Z

Somewhat related: I think it would be good to have a separate class to
do model selection for unsupervised models, as no cross-validation is
needed there....

I agree with you that for a different model selection scheme, a separate
class would be good. I am just wondering: which model selection scheme do
you think would be good? I am not sure that I know of a universal one.

Based on my experience with clustering, I would have done the fit and
score on the same dataset and assumed some ground truth
for the scoring.
Basically the GridSearchUnsupervised would just call fit and score,
and not split up the dataset.
What the score does is basically up to the estimator / user.

GaelVaroquaux reviewed Oct 14, 2012
View reviewed changes

amueller added 2 commits October 14, 2012 19:58

ENH remove unreachable code from grid_search, test unsupervised setting

ca97041

ENH remove even more untested code.

3a8496e

ogrisel reviewed Oct 14, 2012
View reviewed changes

amueller mentioned this pull request Oct 14, 2012

WIP GridSearch with advanced score functions. #1198

Closed

amueller closed this Oct 14, 2012

amueller mentioned this pull request Oct 15, 2012

Add "How to roll your own estimator" to docs #1241

Closed

Uh oh!

Conversation

amueller commented Oct 6, 2012

Uh oh!

agramfort commented Oct 7, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Oct 14, 2012

Uh oh!

amueller commented Oct 14, 2012

Uh oh!

ogrisel commented Oct 14, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Oct 14, 2012

Uh oh!

amueller commented Oct 14, 2012

Uh oh!

GaelVaroquaux commented Oct 15, 2012

Uh oh!

amueller commented Oct 15, 2012

Uh oh!

amueller commented Oct 15, 2012

Uh oh!

amueller commented Oct 15, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants