[MRG] ENH Remove ignored_features in KBinsDiscretizer#11467
[MRG] ENH Remove ignored_features in KBinsDiscretizer#11467TomDLT merged 9 commits intoscikit-learn:discretefrom qinhanmin2014:remove-ignored-features
Conversation
|
What is the benefit of using _transform_selected? |
|
I'd be fine with storing the fitted encoder to inverse_transform. But that's an enhancement to consider later |
No apparent benefits (maybe just avoid some duplicate code) is the same as |
|
The use the latter. Much simpler to read and maintain.
We can then, if we wish, revert all the changes to _transform_selected
currently in discrete (but they're also fairly harmless, and
_transform_selected might be removed in v0.22)
|
|
Also, it may be unnecessary to keep _transform as a separate function
…On 11 July 2018 at 11:04, Joel Nothman ***@***.***> wrote:
The use the latter. Much simpler to read and maintain.
We can then, if we wish, revert all the changes to _transform_selected
currently in discrete (but they're also fairly harmless, and
_transform_selected might be removed in v0.22)
|
qinhanmin2014
left a comment
There was a problem hiding this comment.
ping @jnothman ready for review
| Xt = _transform_selected(X, self._transform, self.dtype, | ||
| self.transformed_features_, copy=True, | ||
| retain_order=True) | ||
| X = check_array(X, dtype=FLOAT_DTYPES) |
There was a problem hiding this comment.
We use X = check_array(X, accept_sparse='csc', copy=True, dtype=FLOAT_DTYPES) in _transform_selected.
Here, I remove accept_sparse because we don't support sparse input here. I remove copy because I don't find it useful. I can't remove dtype because the result will change.
There was a problem hiding this comment.
check_array is already called in _validate_X_post_fit.
Also, copy=True is necessary since X is modified inplace.
This should probably be tested.
There was a problem hiding this comment.
I also realize that self.dtype is not used anymore.
We can probably remove it completely.
|
|
||
| np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf]) | ||
|
|
||
| You can combine ``KBinsDiscretizer`` with ``ColumnTransformer`` if you |
There was a problem hiding this comment.
Probably best to use a :class: reference here
|
ping @TomDLT for a second review if you have time :) |
TomDLT
left a comment
There was a problem hiding this comment.
This is much cleaner, thanks!
Just a few changes necessary
| Xt = _transform_selected(X, self._transform, self.dtype, | ||
| self.transformed_features_, copy=True, | ||
| retain_order=True) | ||
| X = check_array(X, dtype=FLOAT_DTYPES) |
There was a problem hiding this comment.
check_array is already called in _validate_X_post_fit.
Also, copy=True is necessary since X is modified inplace.
This should probably be tested.
| Xt = _transform_selected(X, self._transform, self.dtype, | ||
| self.transformed_features_, copy=True, | ||
| retain_order=True) | ||
| X = check_array(X, dtype=FLOAT_DTYPES) |
There was a problem hiding this comment.
I also realize that self.dtype is not used anymore.
We can probably remove it completely.
|
Thanks @TomDLT for the review. (I'm waiting for CI, not expecting to get a response so quickly :))
I choose to remove _validate_X_post_fit. Firstly, this can avoid duplicate check_array in transform. Secondly, _validate_X_post_fit block us from supporting inverse_transform for encoders other than ordinal.
Thanks, updated with a test. (Seems that there's not such test in the common test? not 100% sure though. )
Thanks, removed. I also let KBinsDiscretizer go through the common test :) |
|
|
||
| Xt = self._validate_X_post_fit(Xt) | ||
| trans = self.transformed_features_ | ||
| Xt = check_array(Xt, dtype='numeric') |
There was a problem hiding this comment.
We should probably use FLOAT_DTYPES here, since we modify Xt inplace and we may want to put float in it.
|
|
||
| bin_edges = self.bin_edges_[trans] | ||
| for jj in range(X.shape[1]): | ||
| Xt = X.copy() |
There was a problem hiding this comment.
Why no using the copy parameter of check_array?
It would avoid a double copy if X is not a float array.
This applies also for inverse_transform.
|
Thanks @TomDLT. Comments addressed. I need to have a more thorough understanding of our check_array :) |
Now we have
ColumnTransformer, so we don't need to supportignored_featuresinKBinsDiscretizer(as we've done inOneHotEncoder). See:#9342 (comment)
#9342 (comment)
I think this solution will work, but we still have things to consider:
(1) Is it good to use
_transform_selected? I think it's acceptable, since with defaultselected="all", it will simply validate the input withcheck_arrayand return directly.(2) Should we support
inverse_transformfor encoders other thanordinal? To support this, one way is to store the fitted OntHotEncoder (or simply build a new one and setcategories_, so that it can supportinverse_transformwithout fitting). The other way is to borrow some code from inverse_transform of OntHotEncoder.