MRG remove unreachable code from grid_search, test unsupervised setting#1210
MRG remove unreachable code from grid_search, test unsupervised setting#1210amueller wants to merge 2 commits intoscikit-learn:masterfrom
Conversation
|
LGTM |
sklearn/grid_search.py
Outdated
There was a problem hiding this comment.
I am not sure: why did you remove these lines. Them seem to be useful to me.
There was a problem hiding this comment.
hm... you mean in the case where the estimator doesn't have a score function but a loss function was specified (and not a score function)?
That is true...
"unfortunately" all estimators have a score-function... I'll have another look again. I'm not sure how it is possible to use the two lines below....
There was a problem hiding this comment.
Maybe I should have waited for #1198... the current GridSearchCV contains some magic...
There was a problem hiding this comment.
Ok, so actually, I should remove even more.
As all estimators define score, this code doesn't make any sense.
How would you use it?
There was a problem hiding this comment.
It used to be the case that score was not part of the required API: if you just implement your own fit / predict estimator you might have expected to be able to use the grid search tools on it.
There was a problem hiding this comment.
On 10/14/2012 08:25 PM, Olivier Grisel wrote:
In sklearn/grid_search.py:
@@ -449,9 +442,5 @@ def fit(self, X, y):
def score(self, X, y=None):
if hasattr(self.best_estimator, 'score'):
return self.best_estimator_.score(X, y)
if self.score_func is None:raise ValueError("No score function explicitly defined, ""and the estimator doesn't provide one %s"% self.best_estimator_)It used to be the case that score was not part of the required API: if
you just implement your own fit / predict estimator you might have
expected to be able to use the grid search tools on it.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1210/files#r1837718.But this is only the case if you didn't inherit from
ClassifierMixin
orRegressorMixin.
I though that was the minimum requirement for the API.
If this is not the case, then I can add some dummy classes to test this
case.
The question is: do we really want to support it.
There was a problem hiding this comment.
On Sun, Oct 14, 2012 at 11:43:32AM -0700, Andreas Mueller wrote:
"unfortunately" all estimators have a score-function...
Awesome! I am surprised, but happily surprised. Even the clustering ones?
But I guess that our code should also work with estimators that we did
not design. These might not have a score method.
There was a problem hiding this comment.
On Sun, Oct 14, 2012 at 12:25:50PM -0700, Olivier Grisel wrote:
It used to be the case that score was not part of the required API: if you just
implement your own fit / predict estimator you might have expected to be able
to use the grid search tools on it.
In my mind, it still is the case.
There was a problem hiding this comment.
On Sun, Oct 14, 2012 at 12:44:38PM -0700, Andreas Mueller wrote:
But this is only the case if you didn't inherit from
ClassifierMixinor
RegressorMixin. I though that was the minimum requirement for the API.
In my mind, it is very important not to require inheritence to use the
scikit-learn. The reason being that it should be possible to be
scikit-learn compliant without having a dependency on it.
There was a problem hiding this comment.
On 10/15/2012 06:32 AM, Gael Varoquaux wrote:
In sklearn/grid_search.py:
@@ -449,9 +442,5 @@ def fit(self, X, y):
def score(self, X, y=None):
if hasattr(self.best_estimator, 'score'):
return self.best_estimator_.score(X, y)
if self.score_func is None:raise ValueError("No score function explicitly defined, ""and the estimator doesn't provide one %s" On Sun, Oct 14, 2012 at 11:43:32AM -0700, Andreas Mueller wrote:% self.best_estimator_)
"unfortunately" all estimators have a score-function...
Awesome! I am surprised, but happily surprised. Even the clustering
ones? But I guess that our code should also work with estimators that
we did not design. These might not have a score method.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1210/files#r1838742.I can't find these comments anywhere on github... so now an unordered
email reply.
Most clustering algorithms can't have a score, because they don't
supportpredict.
|
Apart from the comment on the removed lines, this looks good to me. |
|
Ok so one more comment on the unreachable code: If we want to support that, I can add a dummy estimator that has these properties and test it. Somewhat related: I think it would be good to have a separate class to do model selection for unsupervised models, as I think people usually wouldn't use cross-validation there (most unsupervised models in sklearn don't have a wdyt @GaelVaroquaux @ogrisel ? |
|
I have no strong opinion for supporting estimators without a score function. For the unsupervised case I agree this should better be dealt with using dedicated classes. For instance clustering could be evaluated with the stability selection / consensus index method but this requires a special kind of CV iterator with overlapping folds and we should not render the existing GS class more complex to support this. |
There was a problem hiding this comment.
I don't think it makes sense to use the inertia (the default score method for KMeans) as a way to select the number of clusters. inertia is to be minimized for an a priori fixed number of clusters. If the number of clusters increases, inertia will always decrease hence the best model will always be n_clusters=5 in this case, whatever the data.
There was a problem hiding this comment.
I know. See the discussion with @larsmans that github doesn't show here.
This is just a smoke test.
I could remove the case of unsupervised grid-search altogether.
I don't see a better way of testing it.
|
I just noticed that estimators have to define |
|
Sorry for the confusion. I thought it would be easy to merge some parts of #1198 but apparently that is not the case. |
|
On Sun, Oct 14, 2012 at 01:06:58PM -0700, Andreas Mueller wrote:
Yes!
I'd rather not remove this feature, as it creates heavier requirements on
I agree with you that for a different model selection scheme, a separate |
|
I think the API-requirements for doing grid-search are quite strong already and no-one should attempt to use it without inheriting from the sklearn estimators. It relies heavily on a working In my mind, to be able to grid-search, you inherit. How else? |
|
About requiring to have a score function: in #1198, the requirement is to have a IIRC, it was your idea to handle the score function in the estimator. I could make it such that the standard scores, like |
Based on my experience with clustering, I would have done the |
This significantly increases test-coverage.
Motivated by comments on my randomized search PR.
We should try to get 100% test coverage here. This is really the core of sklearn.