[MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 by DanielMorales9 · Pull Request #11042 · scikit-learn/scikit-learn

DanielMorales9 · 2018-04-28T11:35:27Z

Reference Issues/PRs

Original discussion at #11034

What does this implement/fix? Explain your changes.

jnothman

Please add or modify a test

glemaitre

There 2 calls to _transform_selected which used the default dtype. Check if there is any trouble with those tests.

glemaitre · 2018-05-07T11:26:56Z

sklearn/preprocessing/data.py



-def _transform_selected(X, transform, selected="all", copy=True):
+def _transform_selected(X, transform, dtype=np.float64, selected="all", copy=True):


I don't think that there is a reason to have a default dtype

glemaitre · 2018-05-07T11:47:12Z

sklearn/preprocessing/data.py

    transform : callable
        A callable transform(X) -> X_transformed

+    dtype : number type, default=np.float


number type -> dtype, ...

Could you also change this parameter in the OneHotEncoder docstring

glemaitre · 2018-05-07T11:53:00Z

sklearn/preprocessing/data.py

@@ -1872,9 +1875,9 @@ def _transform_selected(X, transform, selected="all", copy=True):
        X_not_sel = X[:, ind[not_sel]]


Instead of changing the dtype below, I think that you only need to call astype(dtype) on X_not_sel.
The concatenation will be done with array with the same type. You can add a small comment above the line:

The columns of X which are not transformed need to be casted to the desire dtype before concatenation. Otherwise, the stacking will cast to the higher-precision dtype.

Feel free to shorten the comment

glemaitre · 2018-05-07T11:54:25Z

sklearn/preprocessing/tests/test_data.py


    assert_equal(interact.powers_.shape, (interact.n_output_features_,
-                 interact.n_input_features_))
+                                          interact.n_input_features_))


Please revert all the spacing. Even if it solves PEP8, we tend to not modify unrelated code base. It might create some merge conflict in other PR. You can revert the other spaces below.

glemaitre · 2018-05-07T12:17:33Z

sklearn/preprocessing/tests/test_data.py

    _check_one_hot(X, X2, cat, 5)


+def test_one_hot_encoder_mixed_input_given_type():


Could you use pytest.mark.parametrize to make a single test with different dtype. Also use bare assert instead of assert_equal. Basically something like that:

@pytest.mark.parametrize( "output_dtype", [np.int32, np.float32, np.float64] ) @pytest.mark.parametrize( "input_dtype", [np.int32, np.float32, np.float64] ) @pytest.mark.parametrize( "sparse", [True, False] ) def test_one_hot_encoder_preserve_type(input_dtype, output_dtype, sparse): X = np.array([[0, 1, 0, 0], [1, 2, 0, 0]], dtype=input_dtype) transformer = OneHotEncoder(categorical_features=[0, 1], dtype=output_dtype, sparse=sparse) X_trans = transformer.fit_transform(X) assert X_trans.dtype == output_dtype

glemaitre · 2018-05-16T15:14:28Z

@DanielMorales9 Could you address the comments?

DanielMorales9 · 2018-05-16T16:01:32Z

@glemaitre sure

DanielMorales9 · 2018-05-28T14:01:43Z

I've added the requested changes. Sorry for the delay. I am happy to contribute 😄

glemaitre · 2018-05-28T15:03:30Z

The CI is failing can you check

glemaitre · 2018-06-04T10:02:52Z

@DanielMorales9 I make the change regarding the PEP8. I am not sure that the error regarding the kmeans was related. This strange.

glemaitre · 2018-06-04T10:55:37Z

@jnothman Could you have a look

jnothman · 2018-06-04T12:45:32Z

sklearn/preprocessing/tests/test_data.py

+def test_one_hot_encoder_mixed_input_given_type(input_dtype, output_dtype,
+                                                sparse):
+    X = np.array([[0, 2, 1], [1, 0, 3], [1, 0, 2]], dtype=input_dtype)
+    # Test that one hot encoder raises error for unknown features


This appears out of place...

jnothman · 2018-06-04T12:46:08Z

sklearn/preprocessing/tests/test_data.py

+@pytest.mark.parametrize("output_dtype", [np.int32, np.float32, np.float64])
+@pytest.mark.parametrize("input_dtype", [np.int32, np.float32, np.float64])
+@pytest.mark.parametrize("sparse", [True, False])
+def test_one_hot_encoder_mixed_input_given_type(input_dtype, output_dtype,


I'm tired, but it's not clear to me how this is distinct from above.

Uhm, I did not see but but it has an unnecessary test.
The only test required was: #11042 (comment)

amueller · 2018-06-04T17:56:05Z

lgtm.

jnothman

I think this is good, but it might break someone's pipeline. Please add a what's new.

jorisvandenbossche · 2018-06-05T08:26:08Z

but it might break someone's pipeline.

If this might be the case, then it is maybe not worth to do it? As they already have to switch to the new OneHotEncoder behaviour (assuming my PR gets merged, where dtype is already honoured when not using the legacy code), which will change this behaviour anyhow.

Anyhow, I don't care too much, and it's fine for me to merge this (it will give merge conflicts with my other PR, but the diff doesn't look that large, so that should be OK)

jnothman · 2018-06-05T22:44:05Z

i think I'd rather merge than not

jorisvandenbossche · 2018-06-06T07:33:56Z

Fine for me

jnothman · 2018-06-06T08:29:12Z

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

glemaitre · 2018-06-06T09:00:22Z

Thanks @DanielMorales9 !!!
I added the doc such that @jorisvandenbossche can go on with the OneHotEnconder which is a release target.

bug fix scikit-learn#11034

53ab926

DanielMorales9 changed the title ~~Ensure that the OneHotEncoder outputs sparse matrix with given dtype #11034~~ Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 Apr 28, 2018

jnothman reviewed Apr 29, 2018

View reviewed changes

added test for issue scikit-learn#11034

1b95b85

DanielMorales9 changed the title ~~Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034~~ [MRG] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 May 2, 2018

glemaitre requested changes May 7, 2018

View reviewed changes

added reviewd changes

b67d64c

PEP8 and nitpicks

787b270

glemaitre approved these changes Jun 4, 2018

View reviewed changes

PEP8

ac6ffe1

glemaitre changed the title ~~[MRG] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034~~ [MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 Jun 4, 2018

jnothman reviewed Jun 4, 2018

View reviewed changes

remove useless test

36f6421

amueller approved these changes Jun 4, 2018

View reviewed changes

jnothman approved these changes Jun 5, 2018

View reviewed changes

glemaitre added 3 commits June 6, 2018 10:54

DOC add whats new

90f6fd6

Merge branch 'master' into bug-fix-scikit-learn#11034

9a67647

Update v0.20.rst

bf2592b

glemaitre merged commit a6028fc into scikit-learn:master Jun 6, 2018

jorisvandenbossche mentioned this pull request Jun 6, 2018

[MRG+2] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder #10523

Merged



		def _transform_selected(X, transform, selected="all", copy=True):
		def _transform_selected(X, transform, dtype=np.float64, selected="all", copy=True):

		@@ -1872,9 +1875,9 @@ def _transform_selected(X, transform, selected="all", copy=True):
		X_not_sel = X[:, ind[not_sel]]

		_check_one_hot(X, X2, cat, 5)


		def test_one_hot_encoder_mixed_input_given_type():

Uh oh!

Conversation

DanielMorales9 commented Apr 28, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented May 16, 2018

Uh oh!

DanielMorales9 commented May 16, 2018

Uh oh!

DanielMorales9 commented May 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented May 28, 2018

Uh oh!

glemaitre commented Jun 4, 2018

Uh oh!

glemaitre commented Jun 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Jun 4, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jun 5, 2018

Uh oh!

jnothman commented Jun 5, 2018 via email

Uh oh!

jorisvandenbossche commented Jun 6, 2018

Uh oh!

jnothman commented Jun 6, 2018

Uh oh!

glemaitre commented Jun 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DanielMorales9 commented May 28, 2018 •

edited

Loading