ColumnTransformer generalization to work on empty lists #12084
ColumnTransformer generalization to work on empty lists #12084rth merged 9 commits intoscikit-learn:masterfrom
Conversation
rth
left a comment
There was a problem hiding this comment.
Thanks for making this PR!
Could you please change the title to a more meaningful description of what this PR does? That would help PR triage.
|
|
||
| # test case that ensures that the column transformer does also work when | ||
| # a given transformer doesn't have any columns to work on | ||
| ct = ColumnTransformer([('trans1', Trans(), [0, 1]), |
There was a problem hiding this comment.
Maybe it could be a separate test? Also it might be worth checking that
ColumnTransformer([('trans1', Trans(), [])])is an identity operation?
There was a problem hiding this comment.
Is this what you mean?
ct = ColumnTransformer([('trans', Trans(), [])],
remainder='passthrough')
assert_array_equal(ct.fit_transform(X_array), X_res_both)
assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
assert len(ct.transformers_) == 2
jnothman
left a comment
There was a problem hiding this comment.
Please update documentation (of transformers_ at least).
I've not looked at tests.
| next(transformers) | ||
| trans = 'passthrough' | ||
| elif hasattr(column, '__len__') and len(column) == 0: | ||
| trans = old |
jnothman
left a comment
There was a problem hiding this comment.
I'd like this merged before 0.20rc2...
| next(transformers) | ||
| trans = 'passthrough' | ||
| elif hasattr(column, '__len__') and len(column) == 0: | ||
| trans = old |
| def test_column_transformer_empty_columns(): | ||
| # test case that ensures that the column transformer does also work when | ||
| # a given transformer doesn't have any columns to work on | ||
| X_array = np.array([[0, 1, 2], [2, 4, 6]]).T |
There was a problem hiding this comment.
please also test with a DataFrame if pandas is available
There was a problem hiding this comment.
This one is in the fn test_column_transformer_dataframe (needs to be skipped if no pandas)
|
Sorry for my confusion about |
1 similar comment
|
Sorry for my confusion about |
| remainder='passthrough') | ||
| assert_array_equal(ct.fit_transform(X_array), X_res_both) | ||
| assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both) | ||
| assert len(ct.transformers_) == 2 # including remainder |
| assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both) | ||
| assert len(ct.transformers_) == 2 # including remainder | ||
|
|
||
| fixture = np.array([[],[],[]]) |
| remainder='drop') | ||
| assert_array_equal(ct.fit_transform(X_array), fixture) | ||
| assert_array_equal(ct.fit(X_array).transform(X_array), fixture) | ||
| assert len(ct.transformers_) == 2 # including remainder |
|
Travis CI is failing.. |
|
I (presumably) fixed the pep8 problems. Should this issue be resolved first / as well? |
jnothman
left a comment
There was a problem hiding this comment.
Please check more explicitly the values in transformers_ and the presence of 'empty'
| assert_array_equal(ct.fit_transform(X_df), X_res_both) | ||
| assert_array_equal(ct.fit(X_df).transform(X_df), X_res_both) | ||
| assert len(ct.transformers_) == 2 | ||
| assert ct.transformers_[-1][0] != 'remainder' |
There was a problem hiding this comment.
Shouldn't you check that this is 'empty'?
| estimator, 'drop', or 'passthrough'. If there are remaining columns, | ||
| the final element is a tuple of the form: | ||
| estimator, 'drop', or 'passthrough'. Note that the column list is | ||
| allowed to be empty. In that case the transformers will not be fitted. |
There was a problem hiding this comment.
Out of date. "Where there were no columns selected, this string 'empty' will stand in place of the transformer."
|
@jnotham I am still not fully convinced of the 'empty' string, see #12071 (comment) |
|
I find 'empty' more explicit, while an estimator without its fitted attributes might be confusing. But I don't mind very much. |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
In addition to an empty list, I think we should also check for an all-False boolean array
|
Okay, let's do the unfitted transformer. Shrug. |
|
@janvanrijn do you have time to update this today? If not, I can also push some updates (just asking, as we want to get this in as fast as possible, it is one of the remaining blockers for the release) |
|
I am extremely busy with some other task atm, but i can make some time for this. If you have more time, and as you guys have a better overview of what exactly should happen, feel free to alter/update/change/copy from/remove/bypass this pr. |
|
@janvanrijn OK, I pushed some updates to this PR. |
|
@jnothman after updating this PR to also check for all-False boolean arrays, I realized that also slices can be empty .. But to fully properly determine of they are empty, you need the X data, which I actually just removed from |
|
I think we should determine this only during fit. I'm not sure how you would consistently determine this during How about we release and do this in 0.20.1? There's always a .1 any way... |
Well, in most cases this will not be a problem in practice. Because if the structure of your data does not change, this determination of empty selection will be consistent, and if the structure of your data does change between fit and transform, your transform will be fucked up anyway.
If you want to release 0.20 the coming hours, I would propose to merge this as is (it already fixes the main use case, just not yet empty slices), and then we can expand the fix for 0.20.1 |
|
I am able to run the random bot now without problems. Thanks for merging, this is really great |
|
Opened a follow-up issue here: #12162 |
* tag '0.20.0': (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...
* releases: (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...
* dfsg: (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...
Reference Issues/PRs
Fixes #12071
What does this implement/fix? Explain your changes.
Column Transformer can handle 'empty' columns.
Any other comments?
main code by @jorisvandenbossche
test cases by me