Skip to content

Wrong infrequent categories and error in OrdinalEncoder #27088

@xuefeng-xu

Description

@xuefeng-xu

Describe the bug

When I manually set the numpy object to categories in OrdinalEncoder, I got wrong infrequent_categories_.
If I run fit_transform, then I got an error. See the code below.

Steps/Code to Reproduce

import numpy as np
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

categories = [np.array([np.nan, 'b', 'c', 'a'], dtype=object)]
X = np.array([[np.nan]*2+['b']*2+['a']],dtype=object).T

ohe = OneHotEncoder(categories=categories, min_frequency=2)
ode = OrdinalEncoder(categories=categories, min_frequency=2)

ohe.fit(X)
ode.fit(X)

print('onehot', ohe.infrequent_categories_)
print('ordinal', ode.infrequent_categories_)

print(ohe.fit_transform(X))
print(ode.fit_transform(X))

Expected Results

onehot [array(['c', 'a'], dtype=object)]
ordinal [array(['c', 'a'], dtype=object)]
  (0, 0)	1.0
  (1, 0)	1.0
  (2, 1)	1.0
  (3, 1)	1.0
  (4, 2)	1.0
[[nan]
 [nan]
 [0.]
 [0.]
 [ 1.]]

Actual Results

onehot [array(['c', 'a'], dtype=object)]
ordinal [array(['b', 'c'], dtype=object)]
  (0, 0)	1.0
  (1, 0)	1.0
  (2, 1)	1.0
  (3, 1)	1.0
  (4, 2)	1.0
Traceback (most recent call last):
  File "tt.py", line 17, in <module>
    print(ode.fit_transform(X))
  File "/Users/xxf/miniconda3/lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/Users/xxf/miniconda3/lib/python3.8/site-packages/sklearn/base.py", line 915, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/Users/xxf/miniconda3/lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/Users/xxf/miniconda3/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 1573, in transform
    X_int, X_mask = self._transform(
  File "/Users/xxf/miniconda3/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 236, in _transform
    self._map_infrequent_categories(X_int, X_mask, ignore_category_indices)
  File "/Users/xxf/miniconda3/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 437, in _map_infrequent_categories
    X_int[rows_to_update, i] = np.take(mapping, X_int[rows_to_update, i])
  File "<__array_function__ internals>", line 200, in take
  File "/Users/xxf/miniconda3/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 190, in take
    return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
  File "/Users/xxf/miniconda3/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
    return bound(*args, **kwds)
IndexError: index 3 is out of bounds for axis 0 with size 3

Versions

System:
    python: 3.8.16 (default, Jan 17 2023, 16:39:35)  [Clang 14.0.6 ]
executable: /Users/xxf/miniconda3/bin/python
   machine: macOS-13.5-arm64-arm-64bit

Python dependencies:
      sklearn: 1.3.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.24.2
        scipy: 1.10.1
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.7.2
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/xxf/miniconda3/lib/python3.8/site-packages/sklearn/.dylibs/libomp.dylib
        version: None
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/xxf/miniconda3/lib/python3.8/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.21
threading_layer: pthreads
   architecture: armv8
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/xxf/miniconda3/lib/python3.8/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.18
threading_layer: pthreads
   architecture: armv8
    num_threads: 8

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions