ENH Makes ColumnTransformer more flexible by only checking for non-dropped columns by thomasjpfan · Pull Request #19263 · scikit-learn/scikit-learn

thomasjpfan · 2021-01-24T22:02:05Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR enables transform to only require non-dropped columns to exist in the input, X, regardless of the order. Also, dropped columns are not required in transform.

…_transformer

adrinjalali · 2021-01-25T21:12:47Z

So think about this pipeline:

clf = make_pipeline(ColumnTransformer(...., remaining=SequenceAnalyzer()), SGDClassifier())

Assume all those remaining columns form a time series or a sort of a sequence which is understood by SequenceAnalyzer, and SequenceAnalyzer can handle different lengths of those series (in the form of columns). Should ColumnTransformer then accept different lengths? This could happen if the upstream script returns a DataFrame which has the columns supporting maximum length of the series, which could differ between fit and transform here.

If we want ColumnTransformer to be very flexible, a user could argue this should also be supported I guess. (to be clear, I would rather not handle any of these in ColumnTransformer :D )

But if I'm the only one who doesn't like this to be in ColumnTransformer, I'm happy for it to be merged :)

doc/whats_new/v1.0.rst

lorentzenchr

I've reviewed the test cases so far. Are there more edge cases we might want to test?

sklearn/compose/tests/test_column_transformer.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

lorentzenchr

Another round.

sklearn/compose/_column_transformer.py

lorentzenchr · 2021-01-30T12:05:41Z

sklearn/compose/_column_transformer.py

-                "data given during fit."
-            )
-        Xs = self._fit_transform(X, None, _transform_one, fitted=True)
+            # ndarray was used for training


At this stage, the if-else is not enough not know that we trained/fitted on an ndarray. I think the error for the invalid combination "fit df and transform ndarray" will be thrown in self._fit_transform further below.

I updated the comment to:

ndarray was used for fitting or transforming, thus we only check that n_features is consistent

In one of my later commits, this PR started to allow for fitting on dataframe and transforming on numpy arrays and vice vesa. This was to ensure backward compatibility. The new feature enabled by this PR is only active if fitting and transform are being done on dataframes.

sklearn/compose/_column_transformer.py

…_transformer

lorentzenchr

LGTM.
It remains to adapt the docstring and maybe the user guide. When exactly is the order important, when is it not?
(And maybe also correct the docstring sentence "A callable is passed the input data X and can return any of the above.":eyes:)

…_transformer

lorentzenchr · 2021-01-31T19:06:36Z

doc/modules/compose.rst

+and the dataframe only has string column names, then transforming a dataframe
+will use the column names to select the columns::


Suggested change

and the dataframe only has string column names, then transforming a dataframe

will use the column names to select the columns::

and the dataframe has only string column names, then transforming a dataframe

will use the column names to select the columns, no matter in which order they

are::

The dataframe in transform also needs string column names, right?

This PR doesn't check for the dataframe in transform to only string columns. The only requirement was for the dataframe in transform to have all the (str) column names that were required by fit.

I lean slightly toward the current behavior, but I am open to requiring all strings column in transform as well.

…_transformer

rth · 2021-02-01T20:54:44Z

Thanks for proposing this PR!

lorentzenchr · 2021-03-27T08:10:59Z

@thomasjpfan Could you resolve merge conflicts? Then we could label it as waiting for reviewers😏

…_transformer

ogrisel

LGTM, just tiny sugggestions for further improvement. Feel free to ignore if you believe they are not needed/helpful.

doc/modules/compose.rst

sklearn/compose/_column_transformer.py

…_transformer

lorentzenchr · 2021-04-29T19:47:56Z

@thomasjpfan Thanks for taking care of this issue and making it happen for the upcoming 1.0 release.

thomasjpfan added 3 commits January 24, 2021 16:29

ENH Makes ColumnTransformer more flexible

541d99b

Merge remote-tracking branch 'upstream/main' into column_order_column…

fd3ba6b

…_transformer

ENH Reenable previous behavior

b17e472

github-actions bot added the module:compose label Jan 24, 2021

thomasjpfan marked this pull request as draft January 24, 2021 23:52

FIX Only use string columns when all columns are strings

5bfc6fd

thomasjpfan marked this pull request as ready for review January 25, 2021 00:33

DOC Adds whats_new

3a88113

thomasjpfan marked this pull request as draft January 25, 2021 01:32

FIX Handle single string selection

279502b

thomasjpfan marked this pull request as ready for review January 25, 2021 01:51

thomasjpfan added 2 commits January 24, 2021 20:55

FIX Better handling of singular selection

1496d8a

Merge remote-tracking branch 'upstream/main' into column_order_column…

7ea4fb8

…_transformer

CLN Makes diff nicer to look at

ab282b1

jnothman reviewed Jan 26, 2021

View reviewed changes

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

DOC Apply suggestion

1501f3b

lorentzenchr reviewed Jan 28, 2021

View reviewed changes

sklearn/compose/tests/test_column_transformer.py Show resolved Hide resolved

sklearn/compose/tests/test_column_transformer.py Outdated Show resolved Hide resolved

sklearn/compose/tests/test_column_transformer.py Outdated Show resolved Hide resolved

Apply suggestions from code review

3bad1c2

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

lorentzenchr reviewed Jan 30, 2021

View reviewed changes

thomasjpfan added 5 commits January 30, 2021 13:16

Merge remote-tracking branch 'upstream/main' into column_order_column…

d79fcb7

…_transformer

CLN Addresses comments

68e598b

DOC Clarify comment

5344f28

DOC Clarify comment

00eede8

CLN Better names and cleaner logic

b3265af

lorentzenchr approved these changes Jan 31, 2021

View reviewed changes

thomasjpfan added 2 commits January 31, 2021 13:22

Merge remote-tracking branch 'upstream/main' into column_order_column…

773a401

…_transformer

DOC Adds paragraph in user guide

355ffa7

lorentzenchr reviewed Jan 31, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/main' into column_order_column…

794852e

…_transformer

ENH Rename private attributes

92fd0b2

lorentzenchr added this to the 1.0 milestone Feb 22, 2021

Merge branch 'main' into column_order_column_transformer

48f3dbd

rth self-requested a review March 27, 2021 10:09

rth and others added 2 commits March 27, 2021 11:14

Minor merge conflict fixes

3295093

Merge remote-tracking branch 'upstream/main' into column_order_column…

e7f7b0c

…_transformer

ogrisel approved these changes Apr 27, 2021

View reviewed changes

doc/modules/compose.rst Outdated Show resolved Hide resolved

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

thomasjpfan added 4 commits April 29, 2021 14:11

CLN Updates column name

e764239

CLN Uses columns instead

186382b

Merge remote-tracking branch 'upstream/main' into column_order_column…

76bf128

…_transformer

CLN Uses columns instead

f812aeb

lorentzenchr merged commit 9c3b402 into scikit-learn:main Apr 29, 2021

lorentzenchr mentioned this pull request Aug 26, 2021

ENH Adds feature_names_in_ to ColumnTransformer #20839

Merged

		and the dataframe only has string column names, then transforming a dataframe
		will use the column names to select the columns::

Uh oh!

Conversation

thomasjpfan commented Jan 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

adrinjalali commented Jan 25, 2021

Uh oh!

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lorentzenchr Jan 30, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jan 30, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Jan 31, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Feb 1, 2021

Choose a reason for hiding this comment

Uh oh!

rth commented Feb 1, 2021

Uh oh!

lorentzenchr commented Mar 27, 2021

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Apr 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

thomasjpfan commented Jan 24, 2021 •

edited

Loading