ENH Makes ColumnTransformer more flexible by only checking for non-dropped columns#19263
Conversation
|
So think about this pipeline: clf = make_pipeline(ColumnTransformer(...., remaining=SequenceAnalyzer()), SGDClassifier())Assume all those remaining columns form a time series or a sort of a sequence which is understood by If we want But if I'm the only one who doesn't like this to be in |
lorentzenchr
left a comment
There was a problem hiding this comment.
I've reviewed the test cases so far. Are there more edge cases we might want to test?
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
| "data given during fit." | ||
| ) | ||
| Xs = self._fit_transform(X, None, _transform_one, fitted=True) | ||
| # ndarray was used for training |
There was a problem hiding this comment.
At this stage, the if-else is not enough not know that we trained/fitted on an ndarray. I think the error for the invalid combination "fit df and transform ndarray" will be thrown in self._fit_transform further below.
There was a problem hiding this comment.
I updated the comment to:
ndarray was used for fitting or transforming, thus we only check that n_features is consistent
In one of my later commits, this PR started to allow for fitting on dataframe and transforming on numpy arrays and vice vesa. This was to ensure backward compatibility. The new feature enabled by this PR is only active if fitting and transform are being done on dataframes.
lorentzenchr
left a comment
There was a problem hiding this comment.
LGTM.
It remains to adapt the docstring and maybe the user guide. When exactly is the order important, when is it not?
(And maybe also correct the docstring sentence "A callable is passed the input data X and can return any of the above.":eyes:)
| and the dataframe only has string column names, then transforming a dataframe | ||
| will use the column names to select the columns:: |
There was a problem hiding this comment.
| and the dataframe only has string column names, then transforming a dataframe | |
| will use the column names to select the columns:: | |
| and the dataframe has only string column names, then transforming a dataframe | |
| will use the column names to select the columns, no matter in which order they | |
| are:: |
The dataframe in transform also needs string column names, right?
There was a problem hiding this comment.
This PR doesn't check for the dataframe in transform to only string columns. The only requirement was for the dataframe in transform to have all the (str) column names that were required by fit.
I lean slightly toward the current behavior, but I am open to requiring all strings column in transform as well.
|
Thanks for proposing this PR! |
|
@thomasjpfan Could you resolve merge conflicts? Then we could label it as waiting for reviewers😏 |
ogrisel
left a comment
There was a problem hiding this comment.
LGTM, just tiny sugggestions for further improvement. Feel free to ignore if you believe they are not needed/helpful.
|
@thomasjpfan Thanks for taking care of this issue and making it happen for the upcoming 1.0 release. |
Reference Issues/PRs
Fixes #14251
What does this implement/fix? Explain your changes.
This PR enables
transformto only require non-dropped columns to exist in the input,X, regardless of the order. Also, dropped columns are not required intransform.