Skip to content

Duplicate bins in KBinsDiscretizer #12774

@vivekk0903

Description

@vivekk0903

Description

KBinsDiscretizer with strategy='quantile' is producing duplicate bins when used on data which don't have uniform distribution.

Steps/Code to Reproduce

import numpy as np
np.random.randint(1, size=(10,1))
X1 = np.random.randint(1, size=(10,1))
X2 = np.random.randint(4, size=(5,1))

X = np.concatenate([X1, X2], axis=0)

from sklearn.preprocessing import KBinsDiscretizer
transformer = KBinsDiscretizer(encode='ordinal')
transformer.fit(X)

transformer.bin_edges_
# Output: array([array([0., 0., 0., 0., 1., 3.])], dtype=object)

transformer.transform(X)
# Output: 
# array([[3.],
#        [3.],
#        [3.],
#        [3.],
#        [3.],
#        [3.],
#        [3.],
#        [3.],
#        [3.],
#        [3.],
#        [4.],
#        [4.],
#        [4.],
#        [4.],
#        [4.]])

Actual Results

The first three bins are duplicates. They are not used in the output. Even if I change the n_bins to 3 or 4, even then the duplicate bins are generated and then not used.

Expected Results

I understand that:

  1. The duplicate bins are present because the 'strategy' used is 'quantile' and n_bins is fixed.
  2. The bins are not used in output because the internal code is using numpy.isclose and numpy.digitize.

So is there a scope of removing the duplicate bins after fitting with a warning?

Versions

System:
    python: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)  [GCC 7.2.0]
executable: ~/anaconda3/.../bin/python
   machine: Linux-4.15.0-20-generic-x86_64-with-debian-buster-sid

BLAS:
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: ~/anaconda3/.../lib
cblas_libs: mkl_rt, pthread

Python deps:
       pip: 18.1
setuptools: 40.2.0
   sklearn: 0.20.1
     numpy: 1.15.4
     scipy: 1.1.0
    Cython: 0.29
    pandas: 0.23.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugEasyWell-defined and straightforward way to resolvehelp wanted

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions