[MRG] DEP remove legacy mode from OneHotEncoder by glemaitre · Pull Request #13855 · scikit-learn/scikit-learn

glemaitre · 2019-05-10T09:37:01Z

Remove the legacy code from the OneHotEncoder. Since that there is some old tests, I remove and merge some of them without changing what they are testing.

glemaitre · 2019-05-10T09:46:18Z

@jorisvandenbossche could you have a look when this is MRG. You can already check the change within the OHE source code. I am fixing the tests.

jorisvandenbossche

You're faster than me :-)

The changes in the OneHotEncoder itself look good, didn't look at the tests yet.

sklearn/preprocessing/_encoders.py

glemaitre · 2019-05-10T10:15:42Z

I have one failing test in the common tests: check_fit_idempotent. The issue is that it provides some continuous randn data as input which does not suit the OneHotEncoder. Any idea what is the best way to skip this test: (i) add a tag or (ii) skip within the check. I would be more in favor of (i)

NicolasHug · 2019-05-10T14:20:12Z

OneHotEncoder is supposed to work with continuous data too (or at least with floats).

I think the right fix is to set handle_unknown='ignore' in set_checking_parameters().

NicolasHug

Please also remove this part from the docstring:

The OneHotEncoder previously assumed that the input features take on values in the range [0, max(values)). This behaviour is deprecated

LGTM when tests are green

glemaitre · 2019-05-10T17:08:25Z

OneHotEncoder is supposed to work with continuous data too (or at least with floats). I think the right fix is to set handle_unknown='ignore' in set_checking_parameters().

I did that but I got some other failures.

glemaitre · 2019-05-10T17:08:55Z

=================================================================================== test session starts ====================================================================================
platform linux -- Python 3.7.2, pytest-3.8.1, py-1.6.0, pluggy-0.7.1 -- /home/glemaitre/miniconda3/envs/dev/bin/python
cachedir: .pytest_cache
rootdir: /home/glemaitre/Documents/packages/scikit-learn, inifile: setup.cfg
plugins: cov-2.6.0
collected 5703 items / 5671 deselected                                                                                                                                                     

sklearn/tests/test_common.py::test_parameters_default_constructible[OneHotEncoder-OneHotEncoder] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_dtypes] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit_score_takes_y] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_pandas_series] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_list] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_fit_returns_self] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_fit_returns_self(readonly_memmap=True)] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_complex_data] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dtype_object] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_empty_data_messages] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_pipeline_consistency] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_nan_inf] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_overwrite_params] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimator_sparse_data] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_pickle] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_data_not_an_array] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_general] FAILED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_general(readonly_memmap=True)] FAILED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformers_unfitted] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_n_iter] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_predict1d] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_methods_subset_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_1sample] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_1feature] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit1d] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_get_params_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_set_params] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dict_unchanged] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dont_overwrite_parameters] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit_idempotent] PASSED
sklearn/tests/test_common.py::test_no_attributes_set_in_init[OneHotEncoder-estimator118] PASSED

========================================================================================= FAILURES =========================================================================================
_________________________________________________________________ test_estimators[OneHotEncoder-check_transformer_general] _________________________________________________________________

estimator = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
check = <function check_transformer_general at 0x7f12cc9b7268>

    @pytest.mark.parametrize(
            "estimator, check",
            _generate_checks_per_estimator(_yield_all_checks,
                                           _tested_estimators()),
            ids=_rename_partial
    )
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
            name = estimator.__class__.__name__
>           check(name, estimator)

check      = <function check_transformer_general at 0x7f12cc9b7268>
estimator  = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
name       = 'OneHotEncoder'

sklearn/tests/test_common.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/testing.py:355: in wrapper
    return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:955: in check_transformer_general
    _check_transformer(name, transformer, X, y)
sklearn/utils/estimator_checks.py:1054: in _check_transformer
    transformer.transform(X.T)
sklearn/preprocessing/_encoders.py:372: in transform
    X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
X = array([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
        0.61183692, 2.36067187, 0.44584725, 2.6891..., 2.75619882, 0.44342693, 2.44345576, 2.34577095,
        2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
handle_unknown = 'ignore'

    def _transform(self, X, handle_unknown='error'):
        X_list, n_samples, n_features = self._check_X(X)
    
        X_int = np.zeros((n_samples, n_features), dtype=np.int)
        X_mask = np.ones((n_samples, n_features), dtype=np.bool)
    
        for i in range(n_features):
            Xi = X_list[i]
>           diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i],
                                                     return_mask=True)
E           IndexError: list index out of range

X          = array([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
        0.61183692, 2.36067187, 0.44584725, 2.6891..., 2.75619882, 0.44342693, 2.44345576, 2.34577095,
        2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
X_int      = array([[24,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0...,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])
X_list     = [array([2.51189522, 2.6430893 , 2.54847718]), array([0.67033055, 0.66835685, 0.46763126]), array([0.4992157 , 0.448980....58788791, 2.34342981]), array([2.58181919, 2.38607905, 2.28934567]), array([0.61183692, 0.38774913, 0.52384727]), ...]
X_mask     = array([[ True, False, False,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  Tru...ue,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True]])
Xi         = array([2.16073928, 2.58788791, 2.34342981])
_          = array([0.03911488, 0.26329427, 0.35089856, 0.36860623, 0.39741998,
       0.44342693, 0.46763126, 0.50367667, 0.523847...53, 2.40841944, 2.44345576, 2.4511805 , 2.53611719,
       2.54847718, 2.60256625, 2.67679634, 2.73159826, 2.75619882])
diff       = [0.44898091995707645, 0.4992156950294815]
encoded    = array([ 0,  0, 13])
handle_unknown = 'ignore'
i          = 3
n_features = 30
n_samples  = 3
self       = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
valid_mask = array([False, False,  True])

sklearn/preprocessing/_encoders.py:115: IndexError
______________________________________________________ test_estimators[OneHotEncoder-check_transformer_general(readonly_memmap=True)] ______________________________________________________

estimator = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
check = functools.partial(<function check_transformer_general at 0x7f12cc9b7268>, readonly_memmap=True)

    @pytest.mark.parametrize(
            "estimator, check",
            _generate_checks_per_estimator(_yield_all_checks,
                                           _tested_estimators()),
            ids=_rename_partial
    )
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
            name = estimator.__class__.__name__
>           check(name, estimator)

check      = functools.partial(<function check_transformer_general at 0x7f12cc9b7268>, readonly_memmap=True)
estimator  = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
name       = 'OneHotEncoder'

sklearn/tests/test_common.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/testing.py:355: in wrapper
    return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:955: in check_transformer_general
    _check_transformer(name, transformer, X, y)
sklearn/utils/estimator_checks.py:1054: in _check_transformer
    transformer.transform(X.T)
sklearn/preprocessing/_encoders.py:372: in transform
    X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
X = memmap([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
         0.61183692, 2.36067187, 0.44584725, 2.68... 2.75619882, 0.44342693, 2.44345576, 2.34577095,
         2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
handle_unknown = 'ignore'

    def _transform(self, X, handle_unknown='error'):
        X_list, n_samples, n_features = self._check_X(X)
    
        X_int = np.zeros((n_samples, n_features), dtype=np.int)
        X_mask = np.ones((n_samples, n_features), dtype=np.bool)
    
        for i in range(n_features):
            Xi = X_list[i]
>           diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i],
                                                     return_mask=True)
E           IndexError: list index out of range

X          = memmap([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
         0.61183692, 2.36067187, 0.44584725, 2.68... 2.75619882, 0.44342693, 2.44345576, 2.34577095,
         2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
X_int      = array([[24,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0...,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])
X_list     = [array([2.51189522, 2.6430893 , 2.54847718]), array([0.67033055, 0.66835685, 0.46763126]), array([0.4992157 , 0.448980....58788791, 2.34342981]), array([2.58181919, 2.38607905, 2.28934567]), array([0.61183692, 0.38774913, 0.52384727]), ...]
X_mask     = array([[ True, False, False,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  Tru...ue,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True]])
Xi         = array([2.16073928, 2.58788791, 2.34342981])
_          = array([0.03911488, 0.26329427, 0.35089856, 0.36860623, 0.39741998,
       0.44342693, 0.46763126, 0.50367667, 0.523847...53, 2.40841944, 2.44345576, 2.4511805 , 2.53611719,
       2.54847718, 2.60256625, 2.67679634, 2.73159826, 2.75619882])
diff       = [0.44898091995707645, 0.4992156950294815]
encoded    = array([ 0,  0, 13])
handle_unknown = 'ignore'
i          = 3
n_features = 30
n_samples  = 3
self       = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
valid_mask = array([False, False,  True])

sklearn/preprocessing/_encoders.py:115: IndexError

NicolasHug · 2019-05-10T17:35:33Z

You need something like

        if n_features != len(self.categories_):
            raise ValueError("OOPS")

in transform() now

thomasjpfan

Looks good after the following is removed:

scikit-learn/sklearn/preprocessing/_encoders.py

Lines 166 to 167 in 5fff7ab

    
               The OneHotEncoder previously assumed that the input features take on 
        
               values in the range [0, max(values)). This behaviour is deprecated.

jorisvandenbossche

Apart from the aforementioned doc fix and a minor nit (that can be ignored), looks good to me as well

sklearn/preprocessing/tests/test_encoders.py

glemaitre · 2019-05-29T09:54:53Z

I think this is good to be merged. I addressed the last issues.

jnothman · 2019-05-29T11:03:42Z

Thank you, @glemaitre

DEP remove legacy mode from OneHotEncoder

d2a1ba6

glemaitre mentioned this pull request May 10, 2019

Remove deprecated functions and classes 0.22 #13797

Closed

28 tasks

jorisvandenbossche reviewed May 10, 2019

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

fix tests

7a74f1e

glemaitre added 2 commits May 10, 2019 12:19

put back some tests

5d96cfc

fix documentation

194298d

glemaitre changed the title ~~[WIP] DEP remove legacy mode from OneHotEncoder~~ [MRG] DEP remove legacy mode from OneHotEncoder May 10, 2019

NicolasHug approved these changes May 10, 2019

View reviewed changes

glemaitre added 2 commits May 22, 2019 14:09

fix common test and informative error message

625cee6

Merge remote-tracking branch 'origin/master' into depr/legacy_ohe

5fff7ab

thomasjpfan reviewed May 22, 2019

View reviewed changes

jorisvandenbossche approved these changes May 22, 2019

View reviewed changes

sklearn/preprocessing/tests/test_encoders.py Show resolved Hide resolved

glemaitre added 2 commits May 29, 2019 11:38

Merge remote-tracking branch 'origin/master' into depr/legacy_ohe

5137eca

doc fix

b6d3c96

jnothman merged commit 9ee164b into scikit-learn:master May 29, 2019

stsievert mentioned this pull request Jun 7, 2019

sklearn dev tests failing (deprecated argument to OneHotEncoder and QuantileTransformer) dask/dask-ml#517

Closed

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

[MRG] DEP remove legacy mode from OneHotEncoder (scikit-learn#13855)

ff5c059

	The OneHotEncoder previously assumed that the input features take on
	values in the range [0, max(values)). This behaviour is deprecated.

Uh oh!

Conversation

glemaitre commented May 10, 2019

Uh oh!

glemaitre commented May 10, 2019

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre commented May 10, 2019

Uh oh!

NicolasHug commented May 10, 2019

Uh oh!

NicolasHug left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented May 10, 2019

Uh oh!

glemaitre commented May 10, 2019

Uh oh!

NicolasHug commented May 10, 2019

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre commented May 29, 2019

Uh oh!

jnothman commented May 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NicolasHug left a comment •

edited

Loading