ENH Adds _num_features for array-likes#19633
Conversation
|
CC @ogrisel This PR implements the idea from #19555 (comment) |
ogrisel
left a comment
There was a problem hiding this comment.
Thank for taking care of this @thomasjpfan. Here are some comment. Let me know what you think.
It would be great to improve the tests to check that the extended exception messages work as expected.
sklearn/utils/validation.py
Outdated
| if len(X.shape) <= 1: | ||
| raise TypeError(message) | ||
| if isinstance(X.shape[1], numbers.Integral): | ||
| return X.shape[1] |
There was a problem hiding this comment.
Shall we raise ValueError in the else branch of this condition? I am not even sure how we could meaningfully trigger this in tests or in a legit usage scenario.
Maybe we should always return X.shape[1] without the if isinstance(X.shape[1], numbers.Integral) condition.
There was a problem hiding this comment.
I believe this code was inspired from the _num_samples case for .shape[0] where this is not always known for dask dataframes where you can get delayed first dimensions:
>>> df.shape
(Delayed('int-1d60df59-017f-49c3-80e5-f313b06e4c1d'), 120)Furthermore, in this case _num_samples would fallback to calling len(df) which would naturally trigger the delayed computation.
For the _num_features case, we would fallback to doing len(df[0]). In the case of a dask dataframe df[0]) would be a series for the first column and therefore len(df[0]) would return the number of samples instead of the number of features.
Finally I don't see a valid common case where this would happen for the second dimension. I think we can remove this defensive coding condition.
There was a problem hiding this comment.
Just for dask dataframes, should we get the first row with the following?
if hasattr(X, 'iloc'):
first_sample = X.iloc[0, :]
else:
first_sample = X[0](Rename to first_sample to be a little more clear)
There was a problem hiding this comment.
I don't think that's necessary: I see no reason why X.shape[1] would be Delayed in a dask-dataframe. Only the first dimension is lazily depend on the result of index based partitioning, for instance after a groupby operation.
|
I applied my own suggestions to move this PR forward. I will fix the CI if they broke anything.... which they did :) |
ogrisel
left a comment
There was a problem hiding this comment.
LGTM.
@thomasjpfan @NicolasHug let me know if you agree with the changes I pushed. In particular: #19633 (comment)
|
I updated the error message to display the type when the first sample is a string or byte at |
lorentzenchr
left a comment
There was a problem hiding this comment.
Mostly questions for my own better understanding.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
|
@ogrisel As there has been changes since your approval: are we good to merge? |
ogrisel
left a comment
There was a problem hiding this comment.
LGTM with the latest changes. Let's merge.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Reference Issues/PRs
Related to #19555
What does this implement/fix? Explain your changes.
This PR adds a
_num_featuresthat is similar to_num_samplesbut for features:_num_featuresdoes not actual check that every single row has the same number of elements. The validation would either be done before hand by, like by_validate_data, or validation is delegate to another estimator._validate_datawill always try to get the number of features based onXwithout needing toensure_2d. In other words, it_check_n_featureswill validate when it can._num_features, but I am assuming the validation would be done withcheck_arrayat some point.Any other comments?
This PR enables #19555 to call
_check_n_featureswhencv="prefit"to make sure that the prefitted estimator and theCalibrationClassiferCVare compatible.In the end, I am thinking about feature names. I working toward a simple method or function such as
_check_features(X)that any estimator can call, that would do the "correct" thing when it comes ton_features_in_andfeature_names_in_.