ENH Adds array_out="pandas" to transformers in preprocessing module#20100
ENH Adds array_out="pandas" to transformers in preprocessing module#20100thomasjpfan wants to merge 3 commits intoscikit-learn:mainfrom
Conversation
adrinjalali
left a comment
There was a problem hiding this comment.
should get_feature_names or get_feature_names_out be deprecated and become private?
Overall it looks quite good to me.
| X : array-like | ||
| The input samples. | ||
| reset : bool, default=True | ||
| If True, the `n_feature_names_` attribute is set to the feature |
There was a problem hiding this comment.
| If True, the `n_feature_names_` attribute is set to the feature | |
| If True, the `feature_names_in_` attribute is set to the feature |
| returned. If "default", an array-like without feature names is | ||
| returned. |
There was a problem hiding this comment.
it's probably worth explaining why this is not a numpy array, and sometimes a sparse array.
| if y is None: | ||
| # fit method of arity 1 (unsupervised transformation) | ||
| return self.fit(X, **fit_params).transform(X) | ||
| fitted = self.fit(X, **fit_params) |
There was a problem hiding this comment.
Yes that also confused me. I think it's always self, right? Or at least should be.
| skipped_attributes = {'x_scores_', # For PLS, TODO remove in 1.1 | ||
| 'y_scores_'} # For PLS, TODO remove in 1.1 | ||
| 'y_scores_', # For PLS, TODO remove in 1.1 | ||
| 'feature_names_in_'} # Ignore for now |
There was a problem hiding this comment.
are we gonna fix this in this PR? We don't have to, but if we don't should create a separate issue for a sprint or something.
| return get_feature_names_out_callable | ||
|
|
||
|
|
||
| def _make_array_out(X_out, index, get_feature_names_out, *, |
There was a problem hiding this comment.
| def _make_array_out(X_out, index, get_feature_names_out, *, | |
| def _make_array_out(X_out, *, index, get_feature_names_out, |
There was a problem hiding this comment.
also, wouldn't it make more sense for this function to accept the feature_names_out instead of get_feature names_out?
There was a problem hiding this comment.
I guess not, because of the way it's used in the decorator?
|
|
||
| def _make_array_out(X_out, index, get_feature_names_out, *, | ||
| array_out="default"): | ||
| """Construct array container based on global configuration. |
There was a problem hiding this comment.
| """Construct array container based on global configuration. | |
| """Construct array container based on the value of `array_out`. |
I think?
| returned. If "default", an array-like without feature names is | ||
| returned. |
There was a problem hiding this comment.
| returned. If "default", an array-like without feature names is | |
| returned. | |
| returned. If "default", `X_out` is returned unmodified. |
There was a problem hiding this comment.
or should we actually make sure the output is not a dataframe?
| if array_out == "default": | ||
| return X_out | ||
|
|
||
| estimator, X_orig = args[0], args[1] |
There was a problem hiding this comment.
silly question, what would these values be if the user calls est.transform(array_out='dataframe', X=X)?
There was a problem hiding this comment.
tuple unpacking error? args is only est in that case, right?
|
From the dev meeting: We discussed how we want to see how the API of this PR compares to using a I'll come up this other implementation within a week and write up how the API compares to this PR when used with meta-estimators. |
amueller
left a comment
There was a problem hiding this comment.
I was vaguely remembering having an implementation with a constructor parameter, but I think it was actually a constructor parameter that gets the input feature names, not whether to produce pandas dataframes lol... there have been sooo many iterations of this now. This looks good though.
| - |Feature| :class:`preprocessing.OrdinalEncoder` supports passing through | ||
| missing values by default. :pr:`19069` by `Thomas Fan`_. | ||
|
|
||
| - |Feature| Transformers in the :mod:`sklearn.preprocessing` have a `array_out` |
There was a problem hiding this comment.
is array_out a good name? that seems to imply... arrays. output_format? or output?
There was a problem hiding this comment.
I'd be happy with either output or output_format
| if y is None: | ||
| # fit method of arity 1 (unsupervised transformation) | ||
| return self.fit(X, **fit_params).transform(X) | ||
| fitted = self.fit(X, **fit_params) |
There was a problem hiding this comment.
Yes that also confused me. I think it's always self, right? Or at least should be.
| return fitted.transform(X) | ||
|
|
||
| # array_out != "default" | ||
| transform_params = inspect.signature(fitted.transform).parameters |
There was a problem hiding this comment.
Can we not just pass it, or is the error message not good in that case?
There was a problem hiding this comment.
do you mean "can we just pass it"? If we don't pass it when it doesn't exist, it's a silent bug, and I think just passing it gives an error message which can be confusing to many people.
| if array_out == "default": | ||
| return X_out | ||
|
|
||
| estimator, X_orig = args[0], args[1] |
There was a problem hiding this comment.
tuple unpacking error? args is only est in that case, right?
There is still a use case for
poly = PolynomialFeatures().fit(X_df)
# Uses `feature_names_in_` by default as input
poly.get_feature_names()
# without `get_feature_names` a user would need to pass in some dummy data
# to get the feature names:
poly.transform(X_df).columns
# Another API would be to store `feature_names_out_`, which can be a property
# so we do not need to store another array of strings:
poly.feature_names_out_I prefer a |
|
I don't have a strong opinion between |
|
This is superseded by #23734, right? Can we close then? |
|
Yes this has been superseded by #23734 and we can close. |
Reference Issues/PRs
Toward #5523
Alternative to #16772
What does this implement/fix? Explain your changes.
Adds
array_out="pandas"to the transformers in preprocessing module. Here are the decision decisions I made:transformuses the index of the input.feature_names_in_is a sequence of names and does not need to be a numpy array.pd.DataFrame(X)uses integer column names andDictVectorizer.get_feature_namescan output numerical feature names.Any other comments?
Binarizernow have state because ofn_features_in_and nowfeature_names_in_. ShouldBinarizer.transform(df_test, array_out="pandas")use the column names ofdf_testifBinarizer.fitis not called?