Skip to content

[MRG] DEP remove legacy mode from OneHotEncoder#13855

Merged
jnothman merged 8 commits intoscikit-learn:masterfrom
glemaitre:depr/legacy_ohe
May 29, 2019
Merged

[MRG] DEP remove legacy mode from OneHotEncoder#13855
jnothman merged 8 commits intoscikit-learn:masterfrom
glemaitre:depr/legacy_ohe

Conversation

@glemaitre
Copy link
Copy Markdown
Member

Remove the legacy code from the OneHotEncoder. Since that there is some old tests, I remove and merge some of them without changing what they are testing.

@glemaitre
Copy link
Copy Markdown
Member Author

@jorisvandenbossche could you have a look when this is MRG. You can already check the change within the OHE source code. I am fixing the tests.

Copy link
Copy Markdown
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're faster than me :-)

The changes in the OneHotEncoder itself look good, didn't look at the tests yet.

@glemaitre
Copy link
Copy Markdown
Member Author

I have one failing test in the common tests: check_fit_idempotent. The issue is that it provides some continuous randn data as input which does not suit the OneHotEncoder. Any idea what is the best way to skip this test: (i) add a tag or (ii) skip within the check. I would be more in favor of (i)

@glemaitre glemaitre changed the title [WIP] DEP remove legacy mode from OneHotEncoder [MRG] DEP remove legacy mode from OneHotEncoder May 10, 2019
@NicolasHug
Copy link
Copy Markdown
Member

OneHotEncoder is supposed to work with continuous data too (or at least with floats).

I think the right fix is to set handle_unknown='ignore' in set_checking_parameters().

Copy link
Copy Markdown
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also remove this part from the docstring:

The OneHotEncoder previously assumed that the input features take on values in the range [0, max(values)). This behaviour is deprecated

LGTM when tests are green

@glemaitre
Copy link
Copy Markdown
Member Author

OneHotEncoder is supposed to work with continuous data too (or at least with floats). I think the right fix is to set handle_unknown='ignore' in set_checking_parameters().

I did that but I got some other failures.

@glemaitre
Copy link
Copy Markdown
Member Author

=================================================================================== test session starts ====================================================================================
platform linux -- Python 3.7.2, pytest-3.8.1, py-1.6.0, pluggy-0.7.1 -- /home/glemaitre/miniconda3/envs/dev/bin/python
cachedir: .pytest_cache
rootdir: /home/glemaitre/Documents/packages/scikit-learn, inifile: setup.cfg
plugins: cov-2.6.0
collected 5703 items / 5671 deselected                                                                                                                                                     

sklearn/tests/test_common.py::test_parameters_default_constructible[OneHotEncoder-OneHotEncoder] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_dtypes] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit_score_takes_y] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_pandas_series] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_list] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_fit_returns_self] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_fit_returns_self(readonly_memmap=True)] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_complex_data] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dtype_object] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_empty_data_messages] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_pipeline_consistency] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_nan_inf] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_overwrite_params] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimator_sparse_data] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_pickle] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_data_not_an_array] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_general] FAILED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_general(readonly_memmap=True)] FAILED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformers_unfitted] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_n_iter] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_predict1d] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_methods_subset_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_1sample] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_1feature] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit1d] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_get_params_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_set_params] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dict_unchanged] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dont_overwrite_parameters] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit_idempotent] PASSED
sklearn/tests/test_common.py::test_no_attributes_set_in_init[OneHotEncoder-estimator118] PASSED

========================================================================================= FAILURES =========================================================================================
_________________________________________________________________ test_estimators[OneHotEncoder-check_transformer_general] _________________________________________________________________

estimator = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
check = <function check_transformer_general at 0x7f12cc9b7268>

    @pytest.mark.parametrize(
            "estimator, check",
            _generate_checks_per_estimator(_yield_all_checks,
                                           _tested_estimators()),
            ids=_rename_partial
    )
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
            name = estimator.__class__.__name__
>           check(name, estimator)

check      = <function check_transformer_general at 0x7f12cc9b7268>
estimator  = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
name       = 'OneHotEncoder'

sklearn/tests/test_common.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/testing.py:355: in wrapper
    return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:955: in check_transformer_general
    _check_transformer(name, transformer, X, y)
sklearn/utils/estimator_checks.py:1054: in _check_transformer
    transformer.transform(X.T)
sklearn/preprocessing/_encoders.py:372: in transform
    X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
X = array([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
        0.61183692, 2.36067187, 0.44584725, 2.6891..., 2.75619882, 0.44342693, 2.44345576, 2.34577095,
        2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
handle_unknown = 'ignore'

    def _transform(self, X, handle_unknown='error'):
        X_list, n_samples, n_features = self._check_X(X)
    
        X_int = np.zeros((n_samples, n_features), dtype=np.int)
        X_mask = np.ones((n_samples, n_features), dtype=np.bool)
    
        for i in range(n_features):
            Xi = X_list[i]
>           diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i],
                                                     return_mask=True)
E           IndexError: list index out of range

X          = array([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
        0.61183692, 2.36067187, 0.44584725, 2.6891..., 2.75619882, 0.44342693, 2.44345576, 2.34577095,
        2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
X_int      = array([[24,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0...,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])
X_list     = [array([2.51189522, 2.6430893 , 2.54847718]), array([0.67033055, 0.66835685, 0.46763126]), array([0.4992157 , 0.448980....58788791, 2.34342981]), array([2.58181919, 2.38607905, 2.28934567]), array([0.61183692, 0.38774913, 0.52384727]), ...]
X_mask     = array([[ True, False, False,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  Tru...ue,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True]])
Xi         = array([2.16073928, 2.58788791, 2.34342981])
_          = array([0.03911488, 0.26329427, 0.35089856, 0.36860623, 0.39741998,
       0.44342693, 0.46763126, 0.50367667, 0.523847...53, 2.40841944, 2.44345576, 2.4511805 , 2.53611719,
       2.54847718, 2.60256625, 2.67679634, 2.73159826, 2.75619882])
diff       = [0.44898091995707645, 0.4992156950294815]
encoded    = array([ 0,  0, 13])
handle_unknown = 'ignore'
i          = 3
n_features = 30
n_samples  = 3
self       = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
valid_mask = array([False, False,  True])

sklearn/preprocessing/_encoders.py:115: IndexError
______________________________________________________ test_estimators[OneHotEncoder-check_transformer_general(readonly_memmap=True)] ______________________________________________________

estimator = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
check = functools.partial(<function check_transformer_general at 0x7f12cc9b7268>, readonly_memmap=True)

    @pytest.mark.parametrize(
            "estimator, check",
            _generate_checks_per_estimator(_yield_all_checks,
                                           _tested_estimators()),
            ids=_rename_partial
    )
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
            name = estimator.__class__.__name__
>           check(name, estimator)

check      = functools.partial(<function check_transformer_general at 0x7f12cc9b7268>, readonly_memmap=True)
estimator  = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
name       = 'OneHotEncoder'

sklearn/tests/test_common.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/testing.py:355: in wrapper
    return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:955: in check_transformer_general
    _check_transformer(name, transformer, X, y)
sklearn/utils/estimator_checks.py:1054: in _check_transformer
    transformer.transform(X.T)
sklearn/preprocessing/_encoders.py:372: in transform
    X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
X = memmap([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
         0.61183692, 2.36067187, 0.44584725, 2.68... 2.75619882, 0.44342693, 2.44345576, 2.34577095,
         2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
handle_unknown = 'ignore'

    def _transform(self, X, handle_unknown='error'):
        X_list, n_samples, n_features = self._check_X(X)
    
        X_int = np.zeros((n_samples, n_features), dtype=np.int)
        X_mask = np.ones((n_samples, n_features), dtype=np.bool)
    
        for i in range(n_features):
            Xi = X_list[i]
>           diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i],
                                                     return_mask=True)
E           IndexError: list index out of range

X          = memmap([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
         0.61183692, 2.36067187, 0.44584725, 2.68... 2.75619882, 0.44342693, 2.44345576, 2.34577095,
         2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
X_int      = array([[24,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0...,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])
X_list     = [array([2.51189522, 2.6430893 , 2.54847718]), array([0.67033055, 0.66835685, 0.46763126]), array([0.4992157 , 0.448980....58788791, 2.34342981]), array([2.58181919, 2.38607905, 2.28934567]), array([0.61183692, 0.38774913, 0.52384727]), ...]
X_mask     = array([[ True, False, False,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  Tru...ue,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True]])
Xi         = array([2.16073928, 2.58788791, 2.34342981])
_          = array([0.03911488, 0.26329427, 0.35089856, 0.36860623, 0.39741998,
       0.44342693, 0.46763126, 0.50367667, 0.523847...53, 2.40841944, 2.44345576, 2.4511805 , 2.53611719,
       2.54847718, 2.60256625, 2.67679634, 2.73159826, 2.75619882])
diff       = [0.44898091995707645, 0.4992156950294815]
encoded    = array([ 0,  0, 13])
handle_unknown = 'ignore'
i          = 3
n_features = 30
n_samples  = 3
self       = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
valid_mask = array([False, False,  True])

sklearn/preprocessing/_encoders.py:115: IndexError

@NicolasHug
Copy link
Copy Markdown
Member

You need something like

        if n_features != len(self.categories_):
            raise ValueError("OOPS")

in transform() now

Copy link
Copy Markdown
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good after the following is removed:

The OneHotEncoder previously assumed that the input features take on
values in the range [0, max(values)). This behaviour is deprecated.

Copy link
Copy Markdown
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the aforementioned doc fix and a minor nit (that can be ignored), looks good to me as well

@glemaitre
Copy link
Copy Markdown
Member Author

I think this is good to be merged. I addressed the last issues.

@jnothman jnothman merged commit 9ee164b into scikit-learn:master May 29, 2019
@jnothman
Copy link
Copy Markdown
Member

Thank you, @glemaitre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants