Test and doc for n_features_in_ for sklearn.calibration by ogrisel · Pull Request #19555 · scikit-learn/scikit-learn

ogrisel · 2021-02-25T09:50:45Z

Towards #19333.
Fixes: #8710.

The work was already done. I just had to enable the tests and update the docstring.

thomasjpfan

CalibrationClassifier.predict_proba calls check_array but then delegates the responsibly for checking n_features_in_ to the calibrated classifier, which triggers the error message. This behavior makes sense for metaestimators that can pass in all the features to its inner estimator.

sklearn/calibration.py

thomasjpfan · 2021-02-25T16:12:43Z

Should we add versionadded for n_features_in_ everywhere?

ogrisel · 2021-02-25T18:44:27Z

Should we add versionadded for n_features_in_ everywhere?

I think so. Is this a retrospective 0.24?

ogrisel · 2021-02-25T18:48:15Z

CalibrationClassifier.predict_proba calls check_array but then delegates the responsibly for checking n_features_in_ to the calibrated classifier, which triggers the error message. This behavior makes sense for metaestimators that can pass in all the features to its inner estimator.

The code is simpler this way but I wonder if it's right. If the nested base estimator does not have n_features_in_ it we could record it at fit time (even when cv="prefit") at the meta-estimator level and check that in predict_proba. The code would be more complex and I am not sure about the added value.

ogrisel · 2021-02-25T19:48:47Z

Actually I found a problem with non-array inputs (e.g. list of str): those are accepted at fit time but rejected at prediction time because the test was incomplete. This problem was previously reported in #8710 (comment).

I will push a small refactoring fixing this simultaneously.

sklearn/base.py

ogrisel · 2021-02-25T20:02:07Z

On top of fixing the extended version of test_calibration_pipeline, the refactoring also allows for the behavior in the new test test_calibration_inconsistent_prefit_n_features_in which I think is more user friendly.

sklearn/calibration.py

thomasjpfan

Most of my concerns has to do with calling _validate_data in the meta-estimator. This PR essentially, turns off most of the validation in check_array when calling _validate_data such that n_features_in_ can be set by the meta-estimator. I am concerned with copying when X is a list of strings, which may be an edge case.

I have a proposal on having the meta-estimator call _check_n_features directly with an array-like in one of the comments.

sklearn/base.py

sklearn/calibration.py

ogrisel · 2021-03-09T17:44:47Z

Once #19633 is merged, I will update this PR accordingly.

…timator

ogrisel · 2021-03-17T17:19:19Z

sklearn/tests/test_calibration.py

-    X, y = text_data
-    clf = text_data_pipeline
+    X, y = dict_data
+    clf = dict_data_pipeline


@thomasjpfan note that while working on this, I discovered that _num_features(X) == 2 while DictVectorizer will probably never set n_features_in_ because it is flexible enough to accept a variable number of dict entries.

This is not a problem for this PR because I made CalibratedClassifierCV check _num_features(X) only if base_estimator define the n_features_in_ attribute but I wonder if it does not reveal that our _num_features utility is not trying to be too smart.

What do you suggest to be the alternative?

I opened #19740 to have _num_features error for a collection of dicts.

sklearn/tests/test_calibration.py

ogrisel · 2021-03-17T17:22:08Z

sklearn/calibration.py

-            X, y = self._validate_data(
-                X, y, accept_sparse=['csc', 'csr', 'coo'],
-                force_all_finite=False, allow_nd=True
-            )


For consistency I prefer, to never validate the data in the meta estimator and use _safe_indexing in _fit_classifier_calibrator_pair instead. I am not sure if this is right or not.

sklearn/tests/test_calibration.py

…ed 3-fold

ogrisel · 2021-03-17T18:53:24Z

@adrinjalali @lorentzenchr @thomasjpfan I think this is ready for a second review pass.

adrinjalali

LGTM overall. You can ignore most comments if you don't agree with them :)

adrinjalali · 2021-03-21T18:51:03Z

sklearn/calibration.py

@@ -257,7 +265,19 @@ def fit(self, X, y, sample_weight=None):
            else:
                check_is_fitted(self.base_estimator)


I wonder why this if/else is needed, it also checks the actual type, which is not what some of us like to do (looking at you @ogrisel :P )

Good remark, let me check.

I need to pass the classes_ attribute explicitly otherwise it fails but I find the code cleaner this way.

sklearn/calibration.py

adrinjalali · 2021-03-21T19:06:39Z

sklearn/tests/test_calibration.py

-    X, y = text_data
-    clf = text_data_pipeline
+    X, y = dict_data
+    clf = dict_data_pipeline


What do you suggest to be the alternative?

sklearn/tests/test_calibration.py

adrinjalali · 2021-03-21T19:11:49Z

sklearn/utils/estimator_checks.py



 def check_complex_data(name, estimator_orig):
+    rng = np.random.RandomState(42)


is this change related to this PR? (I'm happy to keep it here anyway)

Yes otherwise it would fail in this PR because the validation is now delegated to the underlying estimator, but after the CV split of CalibratedClassifierCV: this test was failing because the default StratifiedKFold CV split was failing because of the unique values in y that prevented to have balanced "classes" in the validation folds.

adrinjalali · 2021-03-21T19:12:46Z

sklearn/calibration.py

+        Number of features seen during :term:`fit`. Only defined if the
+        underlying base_estimator exposes such an attribute when fit.
+
+        .. versionadded:: 0.24


Suggested change

.. versionadded:: 0.24

.. versionadded:: 1.0

n_features_in_ was set in 0.24, (but we did not document it)

Since it was there, I think it's more accurate to document that it was introduced in 0.24, even if not documented.

This can be considered a fix for a bad documentation if you wish.

sklearn/calibration.py

thomasjpfan · 2021-03-21T19:53:34Z

sklearn/calibration.py

+        Number of features seen during :term:`fit`. Only defined if the
+        underlying base_estimator exposes such an attribute when fit.
+
+        .. versionadded:: 0.24


n_features_in_ was set in 0.24, (but we did not document it)

thomasjpfan · 2021-03-21T20:05:30Z

sklearn/tests/test_calibration.py

-    X, y = text_data
-    clf = text_data_pipeline
+    X, y = dict_data
+    clf = dict_data_pipeline


I opened #19740 to have _num_features error for a collection of dicts.

ogrisel · 2021-03-29T16:20:45Z

@thomasjpfan I addressed or answered the remaining remarks let me know what you think.

thomasjpfan

Minor comment, otherwise LGTM

sklearn/calibration.py

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

ogrisel mentioned this pull request Feb 25, 2021

Track SLEP10: Add n_features_in_ to all modules #19333

Closed

47 tasks

ogrisel added No Changelog Needed and removed No Changelog Needed labels Feb 25, 2021

thomasjpfan reviewed Feb 25, 2021

View reviewed changes

sklearn/calibration.py Outdated Show resolved Hide resolved

ogrisel removed the No Changelog Needed label Feb 25, 2021

ogrisel added this to the 1.0 milestone Feb 25, 2021

ogrisel added Bug model:calibration module:calibration and removed model:calibration labels Feb 25, 2021

ogrisel commented Feb 25, 2021

View reviewed changes

sklearn/base.py Show resolved Hide resolved

ogrisel mentioned this pull request Feb 25, 2021

CalibratedClassifierCV doesn't interact properly with Pipeline estimators #8710

Closed

ogrisel added the Waiting for Reviewer label Feb 26, 2021

lorentzenchr reviewed Feb 26, 2021

View reviewed changes

sklearn/calibration.py Show resolved Hide resolved

thomasjpfan reviewed Feb 27, 2021

View reviewed changes

sklearn/base.py Show resolved Hide resolved

sklearn/calibration.py Outdated Show resolved Hide resolved

sklearn/calibration.py Outdated Show resolved Hide resolved

thomasjpfan mentioned this pull request Mar 7, 2021

ENH Adds _num_features for array-likes #19633

Merged

ogrisel mentioned this pull request Mar 9, 2021

CalibratedClassifier on pipelines #19637

Closed

ogrisel added 2 commits March 17, 2021 16:26

Test and doc for n_features_in_ for sklearn.calibration

2f6de70

CalibratedClassifierCV delegates n_features_in_ validation to base_es…

24d8e90

…timator

ogrisel force-pushed the n_features_in-calibration branch from 57742b6 to 24d8e90 Compare March 17, 2021 15:49

ogrisel added 2 commits March 17, 2021 18:02

Better n_features_in_ handling

4a2ce3f

Useless import

446aff1

ogrisel removed the Bug label Mar 17, 2021

ogrisel added the No Changelog Needed label Mar 17, 2021

ogrisel commented Mar 17, 2021

View reviewed changes

sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved

Update sklearn/tests/test_calibration.py

d4f5118

ogrisel commented Mar 17, 2021

View reviewed changes

sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved

ogrisel added 2 commits March 17, 2021 18:24

Update sklearn/tests/test_calibration.py

8ccd12c

Update check_complex_data to generate data that is valid for stratifi…

057a1f9

…ed 3-fold

github-actions bot added the module:utils label Mar 17, 2021

adrinjalali approved these changes Mar 21, 2021

View reviewed changes

thomasjpfan mentioned this pull request Mar 21, 2021

ENH num_features for a 1d collection of dicts is undefined #19740

Merged

thomasjpfan reviewed Mar 21, 2021

View reviewed changes

ogrisel added 6 commits March 29, 2021 16:08

Delegate n_features_in_ check even when prefit

af0f3a9

More intuitive check

1425db0

More intuitive first classifier lookup

aab7e7f

Remove unused imports

cb1548e

Simplify classes_ check on prefit pipeline

c52d2b7

One more useless import

d85441a

style in indexing sample_weight

e166e0e

thomasjpfan approved these changes Mar 29, 2021

View reviewed changes

sklearn/calibration.py Outdated Show resolved Hide resolved

Update sklearn/calibration.py

1ff64be

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

ogrisel merged commit 54ff7b7 into scikit-learn:main Mar 30, 2021

ogrisel deleted the n_features_in-calibration branch March 30, 2021 16:20

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

jeremiedbb mentioned this pull request Apr 15, 2024

FIX CalibratedClassifierCV with string targets #28843

Merged

		@@ -257,7 +265,19 @@ def fit(self, X, y, sample_weight=None):
		else:
		check_is_fitted(self.base_estimator)



		def check_complex_data(name, estimator_orig):
		rng = np.random.RandomState(42)

Uh oh!

Conversation

ogrisel commented Feb 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan commented Feb 25, 2021

Uh oh!

ogrisel commented Feb 25, 2021

Uh oh!

ogrisel commented Feb 25, 2021

Uh oh!

ogrisel commented Feb 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Feb 25, 2021

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Mar 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Mar 17, 2021

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Mar 29, 2021

Uh oh!

ogrisel commented Feb 25, 2021 •

edited

Loading

ogrisel commented Feb 25, 2021 •

edited

Loading

ogrisel Mar 17, 2021 •

edited

Loading

ogrisel Mar 23, 2021 •

edited

Loading

ogrisel Mar 23, 2021 •

edited

Loading