[MRG] DEP remove legacy mode from OneHotEncoder#13855
[MRG] DEP remove legacy mode from OneHotEncoder#13855jnothman merged 8 commits intoscikit-learn:masterfrom
Conversation
|
@jorisvandenbossche could you have a look when this is MRG. You can already check the change within the OHE source code. I am fixing the tests. |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
You're faster than me :-)
The changes in the OneHotEncoder itself look good, didn't look at the tests yet.
|
I have one failing test in the common tests: |
|
OneHotEncoder is supposed to work with continuous data too (or at least with floats). I think the right fix is to set |
I did that but I got some other failures. |
=================================================================================== test session starts ====================================================================================
platform linux -- Python 3.7.2, pytest-3.8.1, py-1.6.0, pluggy-0.7.1 -- /home/glemaitre/miniconda3/envs/dev/bin/python
cachedir: .pytest_cache
rootdir: /home/glemaitre/Documents/packages/scikit-learn, inifile: setup.cfg
plugins: cov-2.6.0
collected 5703 items / 5671 deselected
sklearn/tests/test_common.py::test_parameters_default_constructible[OneHotEncoder-OneHotEncoder] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_dtypes] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit_score_takes_y] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_pandas_series] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_list] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_fit_returns_self] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_fit_returns_self(readonly_memmap=True)] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_complex_data] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dtype_object] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_empty_data_messages] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_pipeline_consistency] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_nan_inf] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_overwrite_params] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimator_sparse_data] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_pickle] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_data_not_an_array] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_general] FAILED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_general(readonly_memmap=True)] FAILED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformers_unfitted] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_n_iter] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_predict1d] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_methods_subset_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_1sample] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_1feature] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit1d] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_get_params_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_set_params] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dict_unchanged] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dont_overwrite_parameters] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit_idempotent] PASSED
sklearn/tests/test_common.py::test_no_attributes_set_in_init[OneHotEncoder-estimator118] PASSED
========================================================================================= FAILURES =========================================================================================
_________________________________________________________________ test_estimators[OneHotEncoder-check_transformer_general] _________________________________________________________________
estimator = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
handle_unknown='ignore', sparse=True)
check = <function check_transformer_general at 0x7f12cc9b7268>
@pytest.mark.parametrize(
"estimator, check",
_generate_checks_per_estimator(_yield_all_checks,
_tested_estimators()),
ids=_rename_partial
)
def test_estimators(estimator, check):
# Common tests for estimator instances
with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
UserWarning, FutureWarning)):
set_checking_parameters(estimator)
name = estimator.__class__.__name__
> check(name, estimator)
check = <function check_transformer_general at 0x7f12cc9b7268>
estimator = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
handle_unknown='ignore', sparse=True)
name = 'OneHotEncoder'
sklearn/tests/test_common.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/testing.py:355: in wrapper
return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:955: in check_transformer_general
_check_transformer(name, transformer, X, y)
sklearn/utils/estimator_checks.py:1054: in _check_transformer
transformer.transform(X.T)
sklearn/preprocessing/_encoders.py:372: in transform
X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
handle_unknown='ignore', sparse=True)
X = array([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
0.61183692, 2.36067187, 0.44584725, 2.6891..., 2.75619882, 0.44342693, 2.44345576, 2.34577095,
2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
handle_unknown = 'ignore'
def _transform(self, X, handle_unknown='error'):
X_list, n_samples, n_features = self._check_X(X)
X_int = np.zeros((n_samples, n_features), dtype=np.int)
X_mask = np.ones((n_samples, n_features), dtype=np.bool)
for i in range(n_features):
Xi = X_list[i]
> diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i],
return_mask=True)
E IndexError: list index out of range
X = array([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
0.61183692, 2.36067187, 0.44584725, 2.6891..., 2.75619882, 0.44342693, 2.44345576, 2.34577095,
2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
X_int = array([[24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
X_list = [array([2.51189522, 2.6430893 , 2.54847718]), array([0.67033055, 0.66835685, 0.46763126]), array([0.4992157 , 0.448980....58788791, 2.34342981]), array([2.58181919, 2.38607905, 2.28934567]), array([0.61183692, 0.38774913, 0.52384727]), ...]
X_mask = array([[ True, False, False, True, True, True, True, True, True,
True, True, True, True, True, Tru...ue, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True]])
Xi = array([2.16073928, 2.58788791, 2.34342981])
_ = array([0.03911488, 0.26329427, 0.35089856, 0.36860623, 0.39741998,
0.44342693, 0.46763126, 0.50367667, 0.523847...53, 2.40841944, 2.44345576, 2.4511805 , 2.53611719,
2.54847718, 2.60256625, 2.67679634, 2.73159826, 2.75619882])
diff = [0.44898091995707645, 0.4992156950294815]
encoded = array([ 0, 0, 13])
handle_unknown = 'ignore'
i = 3
n_features = 30
n_samples = 3
self = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
handle_unknown='ignore', sparse=True)
valid_mask = array([False, False, True])
sklearn/preprocessing/_encoders.py:115: IndexError
______________________________________________________ test_estimators[OneHotEncoder-check_transformer_general(readonly_memmap=True)] ______________________________________________________
estimator = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
handle_unknown='ignore', sparse=True)
check = functools.partial(<function check_transformer_general at 0x7f12cc9b7268>, readonly_memmap=True)
@pytest.mark.parametrize(
"estimator, check",
_generate_checks_per_estimator(_yield_all_checks,
_tested_estimators()),
ids=_rename_partial
)
def test_estimators(estimator, check):
# Common tests for estimator instances
with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
UserWarning, FutureWarning)):
set_checking_parameters(estimator)
name = estimator.__class__.__name__
> check(name, estimator)
check = functools.partial(<function check_transformer_general at 0x7f12cc9b7268>, readonly_memmap=True)
estimator = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
handle_unknown='ignore', sparse=True)
name = 'OneHotEncoder'
sklearn/tests/test_common.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/testing.py:355: in wrapper
return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:955: in check_transformer_general
_check_transformer(name, transformer, X, y)
sklearn/utils/estimator_checks.py:1054: in _check_transformer
transformer.transform(X.T)
sklearn/preprocessing/_encoders.py:372: in transform
X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
handle_unknown='ignore', sparse=True)
X = memmap([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
0.61183692, 2.36067187, 0.44584725, 2.68... 2.75619882, 0.44342693, 2.44345576, 2.34577095,
2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
handle_unknown = 'ignore'
def _transform(self, X, handle_unknown='error'):
X_list, n_samples, n_features = self._check_X(X)
X_int = np.zeros((n_samples, n_features), dtype=np.int)
X_mask = np.ones((n_samples, n_features), dtype=np.bool)
for i in range(n_features):
Xi = X_list[i]
> diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i],
return_mask=True)
E IndexError: list index out of range
X = memmap([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
0.61183692, 2.36067187, 0.44584725, 2.68... 2.75619882, 0.44342693, 2.44345576, 2.34577095,
2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
X_int = array([[24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
X_list = [array([2.51189522, 2.6430893 , 2.54847718]), array([0.67033055, 0.66835685, 0.46763126]), array([0.4992157 , 0.448980....58788791, 2.34342981]), array([2.58181919, 2.38607905, 2.28934567]), array([0.61183692, 0.38774913, 0.52384727]), ...]
X_mask = array([[ True, False, False, True, True, True, True, True, True,
True, True, True, True, True, Tru...ue, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True]])
Xi = array([2.16073928, 2.58788791, 2.34342981])
_ = array([0.03911488, 0.26329427, 0.35089856, 0.36860623, 0.39741998,
0.44342693, 0.46763126, 0.50367667, 0.523847...53, 2.40841944, 2.44345576, 2.4511805 , 2.53611719,
2.54847718, 2.60256625, 2.67679634, 2.73159826, 2.75619882])
diff = [0.44898091995707645, 0.4992156950294815]
encoded = array([ 0, 0, 13])
handle_unknown = 'ignore'
i = 3
n_features = 30
n_samples = 3
self = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
handle_unknown='ignore', sparse=True)
valid_mask = array([False, False, True])
sklearn/preprocessing/_encoders.py:115: IndexError |
|
You need something like in |
thomasjpfan
left a comment
There was a problem hiding this comment.
Looks good after the following is removed:
scikit-learn/sklearn/preprocessing/_encoders.py
Lines 166 to 167 in 5fff7ab
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Apart from the aforementioned doc fix and a minor nit (that can be ignored), looks good to me as well
|
I think this is good to be merged. I addressed the last issues. |
|
Thank you, @glemaitre |
Remove the legacy code from the OneHotEncoder. Since that there is some old tests, I remove and merge some of them without changing what they are testing.