[MRG] ENH: Adds inverse_transform to ColumnTransformer#11639
[MRG] ENH: Adds inverse_transform to ColumnTransformer#11639thomasjpfan wants to merge 55 commits intoscikit-learn:mainfrom
Conversation
| for name, trans, cols in self.transformers: | ||
| col_indices = _get_column_indices(X, cols) | ||
| if not all_indexes.isdisjoint(set(col_indices)): | ||
| self._invertible = (False, |
There was a problem hiding this comment.
This doesn't appear to be covered by tests.
There was a problem hiding this comment.
test_column_transformer_inverse_transform_with_overlaping_slices should cover this, I added the ValueError message in the assertion.
|
|
||
| if not Xs: | ||
| # All transformers are None | ||
| return np.zeros((X.shape[0], 0)) |
There was a problem hiding this comment.
I added test_column_transformer_inverse_transform_all_transformers_drop to cover this.
| inverse_Xs[:, indices] = inverse_X | ||
|
|
||
| if self._X_is_sparse: | ||
| return sparse.csr_matrix(inverse_Xs) |
There was a problem hiding this comment.
I added an assert to test_column_transformer_sparse_array to cover this.
| Returns | ||
| ------- | ||
| Xt : array-like, shape = [n_samples, n_features] | ||
|
|
There was a problem hiding this comment.
Note that it is a pandas DataFrame when ..
| try: | ||
| import pandas as pd | ||
| return pd.DataFrame(inverse_Xs, columns=self._X_columns) | ||
| except ImportError: |
There was a problem hiding this comment.
This makes no sense. Either we promise pandas or we don't. Changing the return type on the basis of not having a dependency can't happen.
| input_indices = [] | ||
| for name, trans, cols in self.transformers: | ||
| col_indices = _get_column_indices(X, cols) | ||
| if not all_indexes.isdisjoint(set(col_indices)): |
There was a problem hiding this comment.
I'm probably missing something. I can't see where you update all_indexes to be non-empty. If I'm right, add tests
There was a problem hiding this comment.
Good catch! It turns out pytest.raises(..., match=...) was need to properly test the Exception.
| input_indices.append(col_indices) | ||
|
|
||
| # check remainder | ||
| remainder_indices = self._remainder[2] |
There was a problem hiding this comment.
We should make _remainder a namedtuple so that this is more legible
| input_indices.append(remainder_indices) | ||
|
|
||
| self._input_indices = input_indices | ||
| self._X_features = X.shape[1] |
There was a problem hiding this comment.
Perhaps name this _n_features_in
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Thanks for working on it!
Added a few comments (didn't look at the tests yet)
| col_indices_set = set(col_indices) | ||
| if not all_indexes.isdisjoint(col_indices_set): | ||
| self._invert_error = ("Unable to invert: transformers " | ||
| "contain overlaping columns") |
There was a problem hiding this comment.
overlaping -> overlapping
There was a problem hiding this comment.
I also think the "Unable to invert" is already included into the message in inverse_transform ?
| "contain overlaping columns") | ||
| return | ||
| if trans == 'drop': | ||
| self._invert_error = "'{}' drops columns".format(name) |
There was a problem hiding this comment.
I would add something explicitly saying that dropping columns is not supported
| Private function to calcuate indicies for inverse_transform | ||
| """ | ||
| # checks for overlap | ||
| all_indexes = set() |
There was a problem hiding this comment.
minor nit, but maybe also use 'indices' instead of 'indexes' since all other variables do that?
|
|
||
| self._input_indices = input_indices | ||
| self._n_features_in = X.shape[1] | ||
| self._X_columns = X.columns if hasattr(X, 'columns') else None |
There was a problem hiding this comment.
maybe getattr(X, 'columns', None) ?
| self._output_indices = [] | ||
| cur_index = 0 | ||
|
|
||
| Xs = self._fit_transform(X[0:1], None, _transform_one, fitted=True) |
There was a problem hiding this comment.
instead of transforming here again, we could also save the dimensions of the outputs in self.fit_transform itself?
There was a problem hiding this comment.
or actually, you can also pass Xs if we want to keep the code here ?
There was a problem hiding this comment.
That is a great idea! Thank you!
| trans = FunctionTransformer( | ||
| validate=False, accept_sparse=True, check_inverse=False) | ||
|
|
||
| inv_transformers.append((name, trans, sub, get_weight(name))) |
There was a problem hiding this comment.
it seems name is not used below, so not needed to pass it?
| inverse_Xs = sparse.lil_matrix((Xs[0].shape[0], | ||
| self._n_features_in)) | ||
| else: | ||
| inverse_Xs = np.zeros((Xs[0].shape[0], self._n_features_in)) |
| else: | ||
| inverse_Xs[:, indices] = inverse_X.toarray() | ||
| else: | ||
| if inverse_X.ndim == 1: |
There was a problem hiding this comment.
Are you referring to inverse_Xs[:, indices] = ... or inverse_x.ndim == 1?
inverse_Xs[:, indices] = ...runs wheninverse_Xis sparse andXis not sparse.test_column_transformer_sparse_arraywas updated to test this.inverse_x.ndim == 1runs wheninverse_Xis not sparse and only has one dimension.test_column_transformer_sparse_stackingtests for this use case.
There was a problem hiding this comment.
I meant the second about inverse_x being 1-dimensional. I would think that any valid sklearn transformer should always return 2D output? (but maybe I am overlooking something). And so I would expect input for inverse_transform to have the same constraint.
There was a problem hiding this comment.
preprocessing.LabelEncoder's transform and inverse_transform returns 1-D arrays.
There was a problem hiding this comment.
LabelEncoder shouldn't be used on X
There was a problem hiding this comment.
This PR was updated to remove the check for 1D outputs.
350652b to
81e3e68
Compare
b6db9ca to
360d09b
Compare
|
This pull request introduces 1 alert when merging 9d0cd00 into 7166cd5 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
jnothman
left a comment
There was a problem hiding this comment.
I wonder if we could come up with a helpful example of this. I'm not sure it's necessary. Noting the available functionality in the user guide might be worthwhile.
Only a partial review.
| else: | ||
| inverse_Xs[:, indices] = inverse_X.toarray() | ||
| else: | ||
| if inverse_X.ndim == 1: |
There was a problem hiding this comment.
LabelEncoder shouldn't be used on X
sklearn/pipeline.py
Outdated
|
|
||
|
|
||
| def _inverse_transform_one(transformer, X, weight, **fit_params): | ||
| weight = weight or 1 |
There was a problem hiding this comment.
I'd rather make 1 if weight is None else weight more explicitly... bool(weight) is not a good way to check for None
|
A motivating example would be nice. I will see if there is a place to add it in |
|
This modifies whats_new/v0.20.rst and should be updated |
|
I have a use case internally where I want to use the preprocessor to process |
|
@glemaitre thank you for the idea! Let’s see if there is a way to integrate it into one of our examples 🤔 |
|
do you want to merge with master for another round of reviews? |
@glemaitre What kind of insights do you get when you have the quantiles in the original scale? |
|
|
||
| def _calculate_inverse_indices(self, X, Xs): | ||
| """ | ||
| Private function to calcuate indices for inverse_transform |
There was a problem hiding this comment.
Minor typo:
| Private function to calcuate indices for inverse_transform | |
| Private function to calculate indices for inverse_transform |
Ouch. I don't remember what was my use case now. I should have directly given what I was programming at that time. It could have been linked to some neuroscience stuff but I am unsure now. |
|
Is this merge-able? |
|
@Jude188 Not right now, do you have an example where this feature would be useful to you? |
|
@thomasjpfan I've got a case where I'm using Sklearn for only data preprocessing and not using an estimator. This is because the model that I have doesn't really have a I'm sure I did a terrible job of explaining that! |
|
hello, any news on that PR? I'm working on interpretability and started working on adding an inverse_transform for ColumnTransformer. I stumbled upon that PR and it would be extremely helpful to me. |
Reference Issues/PRs
Fixes #11463
What does this implement/fix? Explain your changes.
inverse_transformwith overlap ordropwill raise aValueError_calculate_inverse_indicesis used to connect indices from the output space back to the input space.