[MRG+1] discrete branch: add encoding option to KBinsDiscretizer#9647
[MRG+1] discrete branch: add encoding option to KBinsDiscretizer#9647jnothman merged 20 commits intoscikit-learn:discretefrom qinhanmin2014:my-feature-1
Conversation
|
Thanks for this. Tests are failing, though |
|
Oh you mentioned that. A strange failure. Let me check what that branch is doing... |
|
So the test failures relate to a recent Cython release. Merging master into discrete should fix it. I'll do this shortly. |
|
@jnothman Thanks. Kindly inform me when you finish. :) |
|
Try pushing an empty commit |
|
@jnothman test failure unrelated. I believe it worth a review. Thanks :) |
|
@jnothman Finally CIs are green. |
| and return a sparse matrix. | ||
| onehot-dense: | ||
| encode the transformed result with one-hot encoding | ||
| and return a dense matrix. |
| encode the transformed result with one-hot encoding | ||
| and return a dense matrix. | ||
| ordinal: | ||
| do not encode the transformed result. |
There was a problem hiding this comment.
Return the bin identifier encoded as an integer value.
|
@ogrisel Thanks. Comments addressed. |
lesteve
left a comment
There was a problem hiding this comment.
Superficial review without looking in the heart of the code.
|
|
||
| valid_encode = ['onehot', 'onehot-dense', 'ordinal'] | ||
| if self.encode not in valid_encode: | ||
| raise ValueError('Invalid encode value. ' |
There was a problem hiding this comment.
Add the value provided by the user in the error message, i.e. something like this:
"Valid options for 'encode' are {}. Got 'encode={}' instead".format(sorted(valid_encode), encode)| retain_order=True) | ||
|
|
||
| # only one-hot encode discretized features | ||
| mask = np.array([True] * X.shape[1]) |
There was a problem hiding this comment.
Probably more readable to use np.repeat(True, X.shape[1])
| # don't support inverse_transform | ||
| if self.encode != 'ordinal': | ||
| raise ValueError("inverse_transform only support " | ||
| "encode='ordinal'.") |
There was a problem hiding this comment.
Add the value of self.encode in the error message, e.g. . "Got {} instead"
| Column indices of ignored features. (Example: Categorical features.) | ||
| If ``None``, all features will be discretized. | ||
|
|
||
| encode : string {'onehot', 'onehot-dense', 'ordinal'} (default='ordinal') |
There was a problem hiding this comment.
I don't thing you need the type information when you list all possible values. Double-check with the numpy doc style.
There was a problem hiding this comment.
@lesteve It seems that this is not the case in many functions (e.g. pca, LinearSVC) and I have no idea how to check the doc style. Could you please help me? Thanks.
There was a problem hiding this comment.
From https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#sections
When a parameter can only assume one of a fixed set of values, those values can be listed in braces, with the default appearing first:
order : {'C', 'F', 'A'}
Description of `order`.
| assert_raises(ValueError, est.inverse_transform, X) | ||
| est = KBinsDiscretizer(n_bins=[2, 3, 3, 3], | ||
| encode='onehot').fit(X) | ||
| expected3 = est.transform(X) |
There was a problem hiding this comment.
Should you not check that the output is sparse in the onehot case?
Also probably check that the output is not sparse in the onehot-dense case.
|
@lesteve Comments addressed except for the first one. Thanks. |
|
@lesteve Thanks. Comment addressed. |
|
I find a bit fuzzy both naming 'onehot' and 'onehot-dense' since that this is not explicit in the naming that by default this is sparse. Would it make sense to call 'onehot' -> 'onehot-sparse' |
|
Oh I see this is the same naming in the CategoricalEncoder PR. So that might be fine. |
| Column indices of ignored features. (Example: Categorical features.) | ||
| If ``None``, all features will be discretized. | ||
|
|
||
| encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal') |
| If ``None``, all features will be discretized. | ||
|
|
||
| encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal') | ||
| method used to encode the transformed result. |
| method used to encode the transformed result. | ||
|
|
||
| onehot: | ||
| encode the transformed result with one-hot encoding |
| encode the transformed result with one-hot encoding | ||
| and return a sparse matrix. | ||
| onehot-dense: | ||
| encode the transformed result with one-hot encoding |
| encode the transformed result with one-hot encoding | ||
| and return a dense array. | ||
| ordinal: | ||
| return the bin identifier encoded as an integer value. |
| """ | ||
| X = check_array(X, dtype='numeric') | ||
|
|
||
| valid_encode = ['onehot', 'onehot-dense', 'ordinal'] |
There was a problem hiding this comment.
you might want to use a tuple instead of a list.
| if self.encode not in valid_encode: | ||
| raise ValueError("Valid options for 'encode' are {}. " | ||
| "Got 'encode = {}' instead." | ||
| .format(sorted(valid_encode), self.encode)) |
There was a problem hiding this comment.
This seems unnecessary to sort.
| retain_order=True) | ||
|
|
||
| # only one-hot encode discretized features | ||
| mask = np.repeat(True, X.shape[1]) |
There was a problem hiding this comment.
It would be faster to use:
mask = np.ones(X.shape[1], dtype=bool)| assert_array_equal(Xt_expected, Xt) | ||
|
|
||
|
|
||
| def test_encode(): |
There was a problem hiding this comment.
I would probably split this test in several tests
|
Maybe this will land up using the categorical encoder... but it doesn't
need to. For example, unary encoding doesn't make sense for categories, in
general, but it does for ordinals / ordered categories, such as those
produced by a dicretizer. That is, the discretizer isn't producing
categorical features, it's producing ordinal features.
…On 3 September 2017 at 20:44, Guillaume Lemaitre ***@***.***> wrote:
***@***.**** commented on this pull request.
couple of suggestions. @jnothman <https://github.com/jnothman> this code
will be probably changing using the CategoricalEncoder I supposed?
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
+ encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
comma after }
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
+ encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
+ method used to encode the transformed result.
method -> Method
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
+ encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
+ method used to encode the transformed result.
+
+ onehot:
+ encode the transformed result with one-hot encoding
encode -> Encode
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
+ encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
+ method used to encode the transformed result.
+
+ onehot:
+ encode the transformed result with one-hot encoding
+ and return a sparse matrix.
+ onehot-dense:
+ encode the transformed result with one-hot encoding
encode -> Encode
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
+ encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
+ method used to encode the transformed result.
+
+ onehot:
+ encode the transformed result with one-hot encoding
+ and return a sparse matrix.
+ onehot-dense:
+ encode the transformed result with one-hot encoding
+ and return a dense array.
+ ordinal:
+ return the bin identifier encoded as an integer value.
return -> Return
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -114,6 +128,12 @@ def fit(self, X, y=None):
"""
X = check_array(X, dtype='numeric')
+ valid_encode = ['onehot', 'onehot-dense', 'ordinal']
you might want to use a tuple instead of a list.
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -114,6 +128,12 @@ def fit(self, X, y=None):
"""
X = check_array(X, dtype='numeric')
+ valid_encode = ['onehot', 'onehot-dense', 'ordinal']
+ if self.encode not in valid_encode:
+ raise ValueError("Valid options for 'encode' are {}. "
+ "Got 'encode = {}' instead."
+ .format(sorted(valid_encode), self.encode))
This seems unnecessary to sort.
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> Data in the binned space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
X = self._validate_X_post_fit(X)
- return _transform_selected(X, self._transform,
- self.transformed_features_, copy=True,
- retain_order=True)
+ Xt = _transform_selected(X, self._transform,
+ self.transformed_features_, copy=True,
+ retain_order=True)
+
+ # only one-hot encode discretized features
+ mask = np.repeat(True, X.shape[1])
It would be faster to use:
mask = np.ones(X.shape[1], dtype=bool)
------------------------------
In sklearn/preprocessing/tests/test_discretization.py
<#9647 (comment)>
:
> @@ -174,3 +176,40 @@ def test_numeric_stability():
X = X_init / 10**i
Xt = KBinsDiscretizer(n_bins=2).fit_transform(X)
assert_array_equal(Xt_expected, Xt)
+
+
+def test_encode():
I would probably split this test in several tests
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9647 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_ICm3txOQTJwatrzj4NzXqUUczpks5seoLwgaJpZM4PHS2g>
.
|
|
@glemaitre Thanks. Comments addressed. |
doc/modules/preprocessing.rst
Outdated
| >>> est.bin_width_ | ||
| array([ 3., 1., 2.]) | ||
|
|
||
| By default the output is one-hot encoded into a sparse matrix (See :class:`OneHotEncoder`) |
There was a problem hiding this comment.
I'd rather this referred to a section of the user guide which described what this means rather than provides a (in this context irrelevant) tool to do it.
There was a problem hiding this comment.
@jnothman The only place I can think of is http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features. Could you please provide more specific suggestion? Thanks.
There was a problem hiding this comment.
That seems the right place to point to.
|
@jnothman I changed the link to Encoding categorical features Now it looks like this: |
hlin117
left a comment
There was a problem hiding this comment.
Overall looks great! Nice work.
| import numpy as np | ||
| from six.moves import range | ||
| import warnings | ||
| import scipy.sparse as sp |
There was a problem hiding this comment.
Nit: Move this above six.moves, so it is alphabetical.
There was a problem hiding this comment.
In fact you should use sklearn.externals.six to not have dependencies on six
| check_is_fitted(self, ["offset_", "bin_width_"]) | ||
|
|
||
| # currently, preprocessing.OneHotEncoder | ||
| # don't support inverse_transform |
| if self.ignored_features is not None: | ||
| mask[self.ignored_features] = False | ||
|
|
||
| if self.encode == 'onehot': |
There was a problem hiding this comment.
Nit: You can make this more concise.
if self.encode == 'ordinal':
return Xt
# Only one-hot encode discretized features
mask = np.ones(X.shape[1], dtype=bool)
if self.ignored_features is not None:
mask[self.ignored_features] = False
encode_sparse = (self.encode == 'onehot')
return OneHotEncoder(n_values=np.array(self.n_bins_)[mask],
categorical_features='all'
if self.ignored_features is None else mask,
sparse=encode_sparse).fit_transform(Xt)
| """ | ||
| check_is_fitted(self, ["offset_", "bin_width_"]) | ||
|
|
||
| # currently, preprocessing.OneHotEncoder |
There was a problem hiding this comment.
Nit: The convention of this file would be to capitalize comments. ("currently" -> "Currently")
| # currently, preprocessing.OneHotEncoder | ||
| # don't support inverse_transform | ||
| if self.encode != 'ordinal': | ||
| raise ValueError("inverse_transform only support " |
| est = KBinsDiscretizer(n_bins=3, ignored_features=[1, 2], | ||
| encode='onehot-dense').fit(X) | ||
| Xt_1 = est.transform(X) | ||
| Xt_2 = np.array([[1, 0, 0, 1, 0, 0, 1.5, -4], |
There was a problem hiding this comment.
Nit: You don't need to have Xt_2 be an array. A nested list should suffice.
I would also rename Xt_2 -> Xt_expected and Xt_1 -> Xt.
| [0, 1, 0, 1, 0, 0, 2.5, -3], | ||
| [0, 0, 1, 0, 1, 0, 3.5, -2], | ||
| [0, 0, 1, 0, 0, 1, 4.5, -1]]) | ||
| assert_array_equal(Xt_1, Xt_2) |
There was a problem hiding this comment.
Conventions are expected parameter first, and then actual.
|
Thanks for helping review, @hlin117!
…On 5 September 2017 at 11:03, Henry Lin ***@***.***> wrote:
***@***.**** commented on this pull request.
Overall looks great! Nice work.
------------------------------
In sklearn/preprocessing/tests/test_discretization.py
<#9647 (comment)>
:
> @@ -3,8 +3,10 @@
import numpy as np
from six.moves import range
import warnings
+import scipy.sparse as sp
Nit: Move this above six.moves, so it is alphabetical.
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
+
+ # currently, preprocessing.OneHotEncoder
+ # don't support inverse_transform
don't -> doesn't
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> Data in the binned space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
X = self._validate_X_post_fit(X)
- return _transform_selected(X, self._transform,
- self.transformed_features_, copy=True,
- retain_order=True)
+ Xt = _transform_selected(X, self._transform,
+ self.transformed_features_, copy=True,
+ retain_order=True)
+
+ # only one-hot encode discretized features
+ mask = np.ones(X.shape[1], dtype=bool)
+ if self.ignored_features is not None:
+ mask[self.ignored_features] = False
+
+ if self.encode == 'onehot':
Nit: You can make this more concise.
if self.encode == 'ordinal':
return Xt
# Only one-hot encode discretized features
mask = np.ones(X.shape[1], dtype=bool)
if self.ignored_features is not None:
mask[self.ignored_features] = False
encode_sparse = (self.encode == 'onehot')
return OneHotEncoder(n_values=np.array(self.n_bins_)[mask],
categorical_features='all'
if self.ignored_features is None else mask,
sparse=encode_sparse).fit_transform(Xt)
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
+
+ # currently, preprocessing.OneHotEncoder
Nit: The convention of this file would be to capitalize comments.
("currently" -> "Currently")
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
+
+ # currently, preprocessing.OneHotEncoder
+ # don't support inverse_transform
+ if self.encode != 'ordinal':
+ raise ValueError("inverse_transform only support "
support -> supports
------------------------------
In sklearn/preprocessing/tests/test_discretization.py
<#9647 (comment)>
:
> + assert_raises(ValueError, est.inverse_transform, Xt_2)
+ est = KBinsDiscretizer(n_bins=[2, 3, 3, 3],
+ encode='onehot').fit(X)
+ Xt_3 = est.transform(X)
+ assert sp.issparse(Xt_3)
+ assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=True)
+ .fit_transform(Xt_1).toarray(),
+ Xt_3.toarray())
+ assert_raises(ValueError, est.inverse_transform, Xt_3)
+
+
+def test_one_hot_encode_with_ignored_features():
+ est = KBinsDiscretizer(n_bins=3, ignored_features=[1, 2],
+ encode='onehot-dense').fit(X)
+ Xt_1 = est.transform(X)
+ Xt_2 = np.array([[1, 0, 0, 1, 0, 0, 1.5, -4],
Nit: You don't need to have Xt_2 be an array. A nested list should
suffice.
I would also rename Xt_2 -> Xt_expected and Xt_1 -> Xt.
------------------------------
In sklearn/preprocessing/tests/test_discretization.py
<#9647 (comment)>
:
> + assert sp.issparse(Xt_3)
+ assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=True)
+ .fit_transform(Xt_1).toarray(),
+ Xt_3.toarray())
+ assert_raises(ValueError, est.inverse_transform, Xt_3)
+
+
+def test_one_hot_encode_with_ignored_features():
+ est = KBinsDiscretizer(n_bins=3, ignored_features=[1, 2],
+ encode='onehot-dense').fit(X)
+ Xt_1 = est.transform(X)
+ Xt_2 = np.array([[1, 0, 0, 1, 0, 0, 1.5, -4],
+ [0, 1, 0, 1, 0, 0, 2.5, -3],
+ [0, 0, 1, 0, 1, 0, 3.5, -2],
+ [0, 0, 1, 0, 0, 1, 4.5, -1]])
+ assert_array_equal(Xt_1, Xt_2)
Conventions are expected parameter first, and then actual.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9647 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67SYKYgmWhEIQUVJb8W4gMb92VD_ks5sfJ39gaJpZM4PHS2g>
.
|
|
Thanks @hlin117. Comments addressed. |
|
@glemaitre Thanks. Comments addressed. |
|
I made a small pass on it. I have to check the artifacts but I think that I am LGTM. |
| if self.ignored_features is not None: | ||
| mask[self.ignored_features] = False | ||
|
|
||
| encode_sparse = (self.encode == 'onehot') |
There was a problem hiding this comment.
I would remove the paranthesis
| mask[self.ignored_features] = False | ||
|
|
||
| encode_sparse = (self.encode == 'onehot') | ||
| return OneHotEncoder(n_values=np.array(self.n_bins_)[mask], |
There was a problem hiding this comment.
self.n_bins_ is already an array, isn't it? I don't think that you need to pass it inside np.array
There was a problem hiding this comment.
@glemaitre
Thanks. According to the original author, we allow users to pass a list as n_bins_. If we don't do the conversion, we get an error(list indices must be integers).
| """ | ||
| check_is_fitted(self, ["offset_", "bin_width_"]) | ||
|
|
||
| # Currently, preprocessing.OneHotEncoder |
There was a problem hiding this comment.
if you can make your line up to 79 characters, that would be great.
|
|
||
| def test_invalid_encode_option(): | ||
| est = KBinsDiscretizer(n_bins=[2, 3, 3, 3], encode='invalid-encode') | ||
| assert_raises(ValueError, est.fit, X) |
There was a problem hiding this comment.
Could you match the error message (or part of it) using assert_raises_regex
| assert not sp.issparse(Xt_2) | ||
| assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=False) | ||
| .fit_transform(Xt_1), Xt_2) | ||
| assert_raises(ValueError, est.inverse_transform, Xt_2) |
There was a problem hiding this comment.
Could you match the error message (or part of it) using assert_raises_regex
In addition I would group all the error inside a side test.
|
|
||
|
|
||
| def test_encode_options(): | ||
| # test valid encode options through comparison |
| assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=True) | ||
| .fit_transform(Xt_1).toarray(), | ||
| Xt_3.toarray()) | ||
| assert_raises(ValueError, est.inverse_transform, Xt_3) |
There was a problem hiding this comment.
Could you match the error message (or part of it) using assert_raises_regex
|
I don't have computer right now. But I thought to spot that validate_n_bins was converting self.n_bins (which can be a list or int or array) to self.nbins_ which is a numpy array. Can you check that this is the case?
|
|
@glemaitre Thanks a lot. I remove np.array and all the comments are addressed. Sorry @jnothman for not realizing your right suggestions previously. |
|
Any update on it? Thanks :) |
|
I checked the artifact. Can you add an entry in the documentation API to link to the User Guide entry. For example: |
Codecov Report
@@ Coverage Diff @@
## discrete #9647 +/- ##
============================================
- Coverage 96.18% 96.18% -0.01%
============================================
Files 336 338 +2
Lines 63835 62408 -1427
============================================
- Hits 61402 60027 -1375
+ Misses 2433 2381 -52
Continue to review full report at Codecov.
|
|
@glemaitre Sorry to disturb. Seems that the link to the user guide doesn't work and I cannot figure out the reason. |
|
I think that's sufficient consensus... |
|
Thanks @qinhanmin2014 |
|
@jnothman seems that the link to the user guide doesn't work and I cannot find the reason. Could you please help me? Thanks a lot :) |
|
I put the fix in #9705. I think that discretization was duplicated with the next sentence. It seems that this is not case sensitive. I build locally and it was fine after renaming it. |
|
@glemaitre Thanks a lot for the fix :) |

Reference Issue
Fixes #9336
What does this implement/fix? Explain your changes.
add encoding option to KBinsDiscretizer
(1)encode option support {'onehot', 'onehot-dense', 'ordinal'}, the default value is set to 'ordinal' mainly because of (3)
(2)only one-hot encode discretized features when ignored_features is set.
(according to OneHotEncoder, non-categorical features are always stacked to the right of the matrix.)
(3)seems hard to support inverse_transform for one-hot because OneHotEncoder don't support inverse_transform
Any other comments?