ENH Adds _num_features for array-likes by thomasjpfan · Pull Request #19633 · scikit-learn/scikit-learn

thomasjpfan · 2021-03-07T00:05:05Z

Reference Issues/PRs

Related to #19555

What does this implement/fix? Explain your changes.

This PR adds a _num_features that is similar to _num_samples but for features:

_num_features does not actual check that every single row has the same number of elements. The validation would either be done before hand by, like by _validate_data, or validation is delegate to another estimator.
_validate_data will always try to get the number of features based on X without needing to ensure_2d. In other words, it _check_n_features will validate when it can.
There is very light checking here, technically, 3D arrays would work with _num_features, but I am assuming the validation would be done with check_array at some point.

Any other comments?

This PR enables #19555 to call _check_n_features when cv="prefit" to make sure that the prefitted estimator and the CalibrationClassiferCV are compatible.

In the end, I am thinking about feature names. I working toward a simple method or function such as _check_features(X) that any estimator can call, that would do the "correct" thing when it comes to n_features_in_ and feature_names_in_.

thomasjpfan · 2021-03-09T15:08:15Z

CC @ogrisel This PR implements the idea from #19555 (comment)

ogrisel

Thank for taking care of this @thomasjpfan. Here are some comment. Let me know what you think.

It would be great to improve the tests to check that the extended exception messages work as expected.

sklearn/utils/validation.py

sklearn/utils/tests/test_validation.py

sklearn/utils/validation.py

ogrisel · 2021-03-09T17:12:42Z

sklearn/utils/validation.py

+        if len(X.shape) <= 1:
+            raise TypeError(message)
+        if isinstance(X.shape[1], numbers.Integral):
+            return X.shape[1]


Shall we raise ValueError in the else branch of this condition? I am not even sure how we could meaningfully trigger this in tests or in a legit usage scenario.

Maybe we should always return X.shape[1] without the if isinstance(X.shape[1], numbers.Integral) condition.

I believe this code was inspired from the _num_samples case for .shape[0] where this is not always known for dask dataframes where you can get delayed first dimensions:

>>> df.shape (Delayed('int-1d60df59-017f-49c3-80e5-f313b06e4c1d'), 120)

Furthermore, in this case _num_samples would fallback to calling len(df) which would naturally trigger the delayed computation.

For the _num_features case, we would fallback to doing len(df[0]). In the case of a dask dataframe df[0]) would be a series for the first column and therefore len(df[0]) would return the number of samples instead of the number of features.

Finally I don't see a valid common case where this would happen for the second dimension. I think we can remove this defensive coding condition.

Just for dask dataframes, should we get the first row with the following?

if hasattr(X, 'iloc'): first_sample = X.iloc[0, :] else: first_sample = X[0]

(Rename to first_sample to be a little more clear)

I don't think that's necessary: I see no reason why X.shape[1] would be Delayed in a dask-dataframe. Only the first dimension is lazily depend on the result of index based partitioning, for instance after a groupby operation.

sklearn/utils/validation.py

ogrisel · 2021-03-10T16:50:38Z

I applied my own suggestions to move this PR forward. I will fix the CI if they broke anything.... which they did :)

ogrisel

LGTM.

@thomasjpfan @NicolasHug let me know if you agree with the changes I pushed. In particular: #19633 (comment)

thomasjpfan · 2021-03-11T11:23:45Z

I updated the error message to display the type when the first sample is a string or byte at 5c010dc (#19633)

lorentzenchr

Mostly questions for my own better understanding.

sklearn/utils/validation.py

sklearn/utils/tests/test_validation.py

sklearn/base.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

lorentzenchr

LGTM

lorentzenchr · 2021-03-15T18:53:08Z

@ogrisel As there has been changes since your approval: are we good to merge?

ogrisel

LGTM with the latest changes. Let's merge.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

ENH Adds _num_features for array-likes

96d2992

github-actions bot added the module:utils label Mar 7, 2021

thomasjpfan added the No Changelog Needed label Mar 7, 2021

thomasjpfan added 2 commits March 9, 2021 10:08

Merge remote-tracking branch 'upstream/main' into num_features_duck

a5ec173

REV Less diffs

b4f2c6e

ogrisel reviewed Mar 9, 2021

View reviewed changes

ogrisel mentioned this pull request Mar 9, 2021

Test and doc for n_features_in_ for sklearn.calibration #19555

Merged

Apply suggestions from code review

76ca761

ogrisel added 3 commits March 10, 2021 17:57

Fix indent

da47ca9

Better type name

6bb60ad

Avoid defensive coding

6354ed4

ogrisel approved these changes Mar 10, 2021

View reviewed changes

ogrisel and others added 4 commits March 10, 2021 18:51

Even better type names

ee0f477

MNT Slightly nicer variable name

6b726c6

ENH Adds the type of the first element in error message

bbe584b

ENH Update error message about strings

5c010dc

ogrisel added the Waiting for Reviewer label Mar 11, 2021

lorentzenchr reviewed Mar 12, 2021

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

sklearn/utils/tests/test_validation.py Show resolved Hide resolved

sklearn/base.py Outdated Show resolved Hide resolved

thomasjpfan and others added 3 commits March 12, 2021 13:51

Update sklearn/utils/validation.py

5caff72

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

ENH Adds check for n_features_in_ when reset=False

6da6b7f

TST Add tests for different cases of n_features_in_

f4272a9

lorentzenchr approved these changes Mar 15, 2021

View reviewed changes

ogrisel approved these changes Mar 16, 2021

View reviewed changes

ogrisel merged commit d996eaf into scikit-learn:main Mar 16, 2021

marrodion pushed a commit to marrodion/scikit-learn that referenced this pull request Mar 17, 2021

ENH Adds _num_features for array-likes (scikit-learn#19633)

f2921d4

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

Uh oh!

Conversation

thomasjpfan commented Mar 7, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan commented Mar 9, 2021

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel Mar 9, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel Mar 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Mar 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Mar 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Mar 11, 2021

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

lorentzenchr commented Mar 15, 2021

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ogrisel Mar 10, 2021 •

edited

Loading

thomasjpfan Mar 10, 2021 •

edited

Loading

ogrisel Mar 11, 2021 •

edited

Loading

ogrisel commented Mar 10, 2021 •

edited

Loading