[MRG] ENH Remove ignored_features in KBinsDiscretizer by qinhanmin2014 · Pull Request #11467 · scikit-learn/scikit-learn

qinhanmin2014 · 2018-07-10T13:44:18Z

Now we have ColumnTransformer, so we don't need to support ignored_features in KBinsDiscretizer (as we've done in OneHotEncoder). See:
#9342 (comment)
#9342 (comment)

I think this solution will work, but we still have things to consider:
(1) Is it good to use _transform_selected? I think it's acceptable, since with default selected="all", it will simply validate the input with check_array and return directly.
(2) Should we support inverse_transform for encoders other than ordinal? To support this, one way is to store the fitted OntHotEncoder (or simply build a new one and set categories_, so that it can support inverse_transform without fitting). The other way is to borrow some code from inverse_transform of OntHotEncoder.

qinhanmin2014 · 2018-07-10T14:40:04Z

ping @jnothman @TomDLT ready for review :)

jnothman · 2018-07-10T23:17:32Z

What is the benefit of using _transform_selected?

jnothman · 2018-07-10T23:18:44Z

I'd be fine with storing the fitted encoder to inverse_transform. But that's an enhancement to consider later

qinhanmin2014 · 2018-07-11T00:45:36Z

What is the benefit of using _transform_selected?

No apparent benefits (maybe just avoid some duplicate code)

Xt = _transform_selected(X, self._transform, self.dtype, copy=True, retain_order=True)

is the same as

X = check_array(X, accept_sparse='csc', copy=True, dtype=FLOAT_DTYPES)
Xt = self._transform(X)

jnothman · 2018-07-11T01:04:38Z

The use the latter. Much simpler to read and maintain. We can then, if we wish, revert all the changes to _transform_selected currently in discrete (but they're also fairly harmless, and _transform_selected might be removed in v0.22)

jnothman · 2018-07-11T01:04:58Z

Also, it may be unnecessary to keep _transform as a separate function

…

On 11 July 2018 at 11:04, Joel Nothman ***@***.***> wrote: The use the latter. Much simpler to read and maintain. We can then, if we wish, revert all the changes to _transform_selected currently in discrete (but they're also fairly harmless, and _transform_selected might be removed in v0.22)

qinhanmin2014

ping @jnothman ready for review

qinhanmin2014 · 2018-07-11T03:37:16Z

sklearn/preprocessing/_discretization.py

-        Xt = _transform_selected(X, self._transform, self.dtype,
-                                 self.transformed_features_, copy=True,
-                                 retain_order=True)
+        X = check_array(X, dtype=FLOAT_DTYPES)


We use X = check_array(X, accept_sparse='csc', copy=True, dtype=FLOAT_DTYPES) in _transform_selected.
Here, I remove accept_sparse because we don't support sparse input here. I remove copy because I don't find it useful. I can't remove dtype because the result will change.

check_array is already called in _validate_X_post_fit.

Also, copy=True is necessary since X is modified inplace.
This should probably be tested.

I also realize that self.dtype is not used anymore.
We can probably remove it completely.

jnothman

LGTM. Thank you

jnothman · 2018-07-11T03:57:41Z

sklearn/preprocessing/_discretization.py


      np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf])

+    You can combine ``KBinsDiscretizer`` with ``ColumnTransformer`` if you


Probably best to use a :class: reference here

qinhanmin2014 · 2018-07-11T07:26:21Z

ping @TomDLT for a second review if you have time :)

TomDLT

This is much cleaner, thanks!
Just a few changes necessary

TomDLT · 2018-07-11T09:20:02Z

sklearn/preprocessing/_discretization.py

-        Xt = _transform_selected(X, self._transform, self.dtype,
-                                 self.transformed_features_, copy=True,
-                                 retain_order=True)
+        X = check_array(X, dtype=FLOAT_DTYPES)


check_array is already called in _validate_X_post_fit.

Also, copy=True is necessary since X is modified inplace.
This should probably be tested.

TomDLT · 2018-07-11T09:21:20Z

sklearn/preprocessing/_discretization.py

-        Xt = _transform_selected(X, self._transform, self.dtype,
-                                 self.transformed_features_, copy=True,
-                                 retain_order=True)
+        X = check_array(X, dtype=FLOAT_DTYPES)


I also realize that self.dtype is not used anymore.
We can probably remove it completely.

TomDLT

LGTM

qinhanmin2014 · 2018-07-11T12:40:28Z

Thanks @TomDLT for the review. (I'm waiting for CI, not expecting to get a response so quickly :))

check_array is already called in _validate_X_post_fit

I choose to remove _validate_X_post_fit. Firstly, this can avoid duplicate check_array in transform. Secondly, _validate_X_post_fit block us from supporting inverse_transform for encoders other than ordinal.

Also, copy=True is necessary since X is modified inplace. This should probably be tested.

Thanks, updated with a test. (Seems that there's not such test in the common test? not 100% sure though. )

I also realize that self.dtype is not used anymore.

Thanks, removed.

I also let KBinsDiscretizer go through the common test :)

qinhanmin2014 · 2018-07-11T14:33:17Z

CIs are green. ping @jnothman for a final check or @TomDLT if you're confident enough to merge directly. Thanks :)

TomDLT · 2018-07-11T14:48:14Z

sklearn/preprocessing/_discretization.py


-        Xt = self._validate_X_post_fit(Xt)
-        trans = self.transformed_features_
+        Xt = check_array(Xt, dtype='numeric')


We should probably use FLOAT_DTYPES here, since we modify Xt inplace and we may want to put float in it.

TomDLT · 2018-07-11T14:50:20Z

sklearn/preprocessing/_discretization.py


-        bin_edges = self.bin_edges_[trans]
-        for jj in range(X.shape[1]):
+        Xt = X.copy()


Why no using the copy parameter of check_array?
It would avoid a double copy if X is not a float array.

This applies also for inverse_transform.

qinhanmin2014 · 2018-07-11T15:35:17Z

Thanks @TomDLT. Comments addressed. I need to have a more thorough understanding of our check_array :)

qinhanmin2014 added 2 commits July 10, 2018 21:23

remove ignored features

9f9836d

flake8

f65cd17

scikit-learn deleted a comment from sklearn-lgtm Jul 10, 2018

qinhanmin2014 added 2 commits July 11, 2018 10:36

address comment

4ca583e

flake8

78bdb44

qinhanmin2014 commented Jul 11, 2018

View reviewed changes

jnothman approved these changes Jul 11, 2018

View reviewed changes

address comment

e7efee9

qinhanmin2014 added this to the 0.20 milestone Jul 11, 2018

TomDLT suggested changes Jul 11, 2018

View reviewed changes

qinhanmin2014 added 3 commits July 11, 2018 19:52

address comment

746a1f3

address comment

03945e7

flake8 sorry for the noise

ab2acdd

TomDLT approved these changes Jul 11, 2018

View reviewed changes

TomDLT reviewed Jul 11, 2018

View reviewed changes

address comment

b4c7cbe

TomDLT approved these changes Jul 11, 2018

View reviewed changes

TomDLT merged commit e4089d1 into scikit-learn:discrete Jul 11, 2018

qinhanmin2014 deleted the remove-ignored-features branch July 12, 2018 01:07

qinhanmin2014 mentioned this pull request Jul 12, 2018

KBinsDiscretizer : Support inverse_transform for encode other than ordinal #11489

Closed


		np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf])

		You can combine ``KBinsDiscretizer`` with ``ColumnTransformer`` if you

Uh oh!

Conversation

qinhanmin2014 commented Jul 10, 2018

Uh oh!

qinhanmin2014 commented Jul 10, 2018

Uh oh!

jnothman commented Jul 10, 2018

Uh oh!

jnothman commented Jul 10, 2018

Uh oh!

qinhanmin2014 commented Jul 11, 2018

Uh oh!

jnothman commented Jul 11, 2018 via email

Uh oh!

jnothman commented Jul 11, 2018 via email

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Jul 11, 2018

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Jul 11, 2018

Uh oh!

qinhanmin2014 commented Jul 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Jul 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants