Test and doc for n_features_in_ for sklearn.calibration#19555
Test and doc for n_features_in_ for sklearn.calibration#19555ogrisel merged 15 commits intoscikit-learn:mainfrom
Conversation
thomasjpfan
left a comment
There was a problem hiding this comment.
CalibrationClassifier.predict_proba calls check_array but then delegates the responsibly for checking n_features_in_ to the calibrated classifier, which triggers the error message. This behavior makes sense for metaestimators that can pass in all the features to its inner estimator.
|
Should we add |
I think so. Is this a retrospective 0.24? |
The code is simpler this way but I wonder if it's right. If the nested base estimator does not have |
|
Actually I found a problem with non-array inputs (e.g. list of str): those are accepted at fit time but rejected at prediction time because the test was incomplete. This problem was previously reported in #8710 (comment). I will push a small refactoring fixing this simultaneously. |
|
On top of fixing the extended version of |
thomasjpfan
left a comment
There was a problem hiding this comment.
Most of my concerns has to do with calling _validate_data in the meta-estimator. This PR essentially, turns off most of the validation in check_array when calling _validate_data such that n_features_in_ can be set by the meta-estimator. I am concerned with copying when X is a list of strings, which may be an edge case.
I have a proposal on having the meta-estimator call _check_n_features directly with an array-like in one of the comments.
|
Once #19633 is merged, I will update this PR accordingly. |
57742b6 to
24d8e90
Compare
| X, y = text_data | ||
| clf = text_data_pipeline | ||
| X, y = dict_data | ||
| clf = dict_data_pipeline |
There was a problem hiding this comment.
@thomasjpfan note that while working on this, I discovered that _num_features(X) == 2 while DictVectorizer will probably never set n_features_in_ because it is flexible enough to accept a variable number of dict entries.
This is not a problem for this PR because I made CalibratedClassifierCV check _num_features(X) only if base_estimator define the n_features_in_ attribute but I wonder if it does not reveal that our _num_features utility is not trying to be too smart.
There was a problem hiding this comment.
What do you suggest to be the alternative?
There was a problem hiding this comment.
I opened #19740 to have _num_features error for a collection of dicts.
| X, y = self._validate_data( | ||
| X, y, accept_sparse=['csc', 'csr', 'coo'], | ||
| force_all_finite=False, allow_nd=True | ||
| ) |
There was a problem hiding this comment.
For consistency I prefer, to never validate the data in the meta estimator and use _safe_indexing in _fit_classifier_calibrator_pair instead. I am not sure if this is right or not.
|
@adrinjalali @lorentzenchr @thomasjpfan I think this is ready for a second review pass. |
adrinjalali
left a comment
There was a problem hiding this comment.
LGTM overall. You can ignore most comments if you don't agree with them :)
sklearn/calibration.py
Outdated
| @@ -257,7 +265,19 @@ def fit(self, X, y, sample_weight=None): | |||
| else: | |||
| check_is_fitted(self.base_estimator) | |||
There was a problem hiding this comment.
I wonder why this if/else is needed, it also checks the actual type, which is not what some of us like to do (looking at you @ogrisel :P )
There was a problem hiding this comment.
Good remark, let me check.
There was a problem hiding this comment.
I need to pass the classes_ attribute explicitly otherwise it fails but I find the code cleaner this way.
| X, y = text_data | ||
| clf = text_data_pipeline | ||
| X, y = dict_data | ||
| clf = dict_data_pipeline |
There was a problem hiding this comment.
What do you suggest to be the alternative?
|
|
||
|
|
||
| def check_complex_data(name, estimator_orig): | ||
| rng = np.random.RandomState(42) |
There was a problem hiding this comment.
is this change related to this PR? (I'm happy to keep it here anyway)
There was a problem hiding this comment.
Yes otherwise it would fail in this PR because the validation is now delegated to the underlying estimator, but after the CV split of CalibratedClassifierCV: this test was failing because the default StratifiedKFold CV split was failing because of the unique values in y that prevented to have balanced "classes" in the validation folds.
| Number of features seen during :term:`fit`. Only defined if the | ||
| underlying base_estimator exposes such an attribute when fit. | ||
|
|
||
| .. versionadded:: 0.24 |
There was a problem hiding this comment.
| .. versionadded:: 0.24 | |
| .. versionadded:: 1.0 |
There was a problem hiding this comment.
n_features_in_ was set in 0.24, (but we did not document it)
There was a problem hiding this comment.
Since it was there, I think it's more accurate to document that it was introduced in 0.24, even if not documented.
This can be considered a fix for a bad documentation if you wish.
| Number of features seen during :term:`fit`. Only defined if the | ||
| underlying base_estimator exposes such an attribute when fit. | ||
|
|
||
| .. versionadded:: 0.24 |
There was a problem hiding this comment.
n_features_in_ was set in 0.24, (but we did not document it)
| X, y = text_data | ||
| clf = text_data_pipeline | ||
| X, y = dict_data | ||
| clf = dict_data_pipeline |
There was a problem hiding this comment.
I opened #19740 to have _num_features error for a collection of dicts.
|
@thomasjpfan I addressed or answered the remaining remarks let me know what you think. |
thomasjpfan
left a comment
There was a problem hiding this comment.
Minor comment, otherwise LGTM
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Towards #19333.
Fixes: #8710.
The work was already done. I just had to enable the tests and update the docstring.