[MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034#11042
Conversation
glemaitre
left a comment
There was a problem hiding this comment.
There 2 calls to _transform_selected which used the default dtype. Check if there is any trouble with those tests.
sklearn/preprocessing/data.py
Outdated
|
|
||
|
|
||
| def _transform_selected(X, transform, selected="all", copy=True): | ||
| def _transform_selected(X, transform, dtype=np.float64, selected="all", copy=True): |
There was a problem hiding this comment.
I don't think that there is a reason to have a default dtype
sklearn/preprocessing/data.py
Outdated
| transform : callable | ||
| A callable transform(X) -> X_transformed | ||
|
|
||
| dtype : number type, default=np.float |
There was a problem hiding this comment.
number type -> dtype, ...
Could you also change this parameter in the OneHotEncoder docstring
sklearn/preprocessing/data.py
Outdated
| @@ -1872,9 +1875,9 @@ def _transform_selected(X, transform, selected="all", copy=True): | |||
| X_not_sel = X[:, ind[not_sel]] | |||
There was a problem hiding this comment.
Instead of changing the dtype below, I think that you only need to call astype(dtype) on X_not_sel.
The concatenation will be done with array with the same type. You can add a small comment above the line:
The columns of X which are not transformed need to be casted to the desire dtype before concatenation. Otherwise, the stacking will cast to the higher-precision dtype.
There was a problem hiding this comment.
Feel free to shorten the comment
|
|
||
| assert_equal(interact.powers_.shape, (interact.n_output_features_, | ||
| interact.n_input_features_)) | ||
| interact.n_input_features_)) |
There was a problem hiding this comment.
Please revert all the spacing. Even if it solves PEP8, we tend to not modify unrelated code base. It might create some merge conflict in other PR. You can revert the other spaces below.
| _check_one_hot(X, X2, cat, 5) | ||
|
|
||
|
|
||
| def test_one_hot_encoder_mixed_input_given_type(): |
There was a problem hiding this comment.
Could you use pytest.mark.parametrize to make a single test with different dtype. Also use bare assert instead of assert_equal. Basically something like that:
@pytest.mark.parametrize(
"output_dtype",
[np.int32, np.float32, np.float64]
)
@pytest.mark.parametrize(
"input_dtype",
[np.int32, np.float32, np.float64]
)
@pytest.mark.parametrize(
"sparse",
[True, False]
)
def test_one_hot_encoder_preserve_type(input_dtype, output_dtype, sparse):
X = np.array([[0, 1, 0, 0], [1, 2, 0, 0]], dtype=input_dtype)
transformer = OneHotEncoder(categorical_features=[0, 1],
dtype=output_dtype, sparse=sparse)
X_trans = transformer.fit_transform(X)
assert X_trans.dtype == output_dtype|
@DanielMorales9 Could you address the comments? |
|
@glemaitre sure |
|
I've added the requested changes. Sorry for the delay. I am happy to contribute 😄 |
|
The CI is failing can you check |
|
@DanielMorales9 I make the change regarding the PEP8. I am not sure that the error regarding the kmeans was related. This strange. |
|
@jnothman Could you have a look |
| def test_one_hot_encoder_mixed_input_given_type(input_dtype, output_dtype, | ||
| sparse): | ||
| X = np.array([[0, 2, 1], [1, 0, 3], [1, 0, 2]], dtype=input_dtype) | ||
| # Test that one hot encoder raises error for unknown features |
| @pytest.mark.parametrize("output_dtype", [np.int32, np.float32, np.float64]) | ||
| @pytest.mark.parametrize("input_dtype", [np.int32, np.float32, np.float64]) | ||
| @pytest.mark.parametrize("sparse", [True, False]) | ||
| def test_one_hot_encoder_mixed_input_given_type(input_dtype, output_dtype, |
There was a problem hiding this comment.
I'm tired, but it's not clear to me how this is distinct from above.
There was a problem hiding this comment.
Uhm, I did not see but but it has an unnecessary test.
The only test required was: #11042 (comment)
|
lgtm. |
jnothman
left a comment
There was a problem hiding this comment.
I think this is good, but it might break someone's pipeline. Please add a what's new.
If this might be the case, then it is maybe not worth to do it? As they already have to switch to the new OneHotEncoder behaviour (assuming my PR gets merged, where dtype is already honoured when not using the legacy code), which will change this behaviour anyhow. Anyhow, I don't care too much, and it's fine for me to merge this (it will give merge conflicts with my other PR, but the diff doesn't look that large, so that should be OK) |
|
i think I'd rather merge than not
|
|
Fine for me |
|
Please add an entry to the change log at |
|
Thanks @DanielMorales9 !!! |
Reference Issues/PRs
Original discussion at #11034
What does this implement/fix? Explain your changes.