-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Duplicate bins in KBinsDiscretizer #12774
Copy link
Copy link
Closed
Labels
BugEasyWell-defined and straightforward way to resolveWell-defined and straightforward way to resolvehelp wanted
Description
Description
KBinsDiscretizer with strategy='quantile' is producing duplicate bins when used on data which don't have uniform distribution.
Steps/Code to Reproduce
import numpy as np
np.random.randint(1, size=(10,1))
X1 = np.random.randint(1, size=(10,1))
X2 = np.random.randint(4, size=(5,1))
X = np.concatenate([X1, X2], axis=0)
from sklearn.preprocessing import KBinsDiscretizer
transformer = KBinsDiscretizer(encode='ordinal')
transformer.fit(X)
transformer.bin_edges_
# Output: array([array([0., 0., 0., 0., 1., 3.])], dtype=object)
transformer.transform(X)
# Output:
# array([[3.],
# [3.],
# [3.],
# [3.],
# [3.],
# [3.],
# [3.],
# [3.],
# [3.],
# [3.],
# [4.],
# [4.],
# [4.],
# [4.],
# [4.]])Actual Results
The first three bins are duplicates. They are not used in the output. Even if I change the n_bins to 3 or 4, even then the duplicate bins are generated and then not used.
Expected Results
I understand that:
- The duplicate bins are present because the
'strategy'used is 'quantile' andn_binsis fixed. - The bins are not used in output because the internal code is using
numpy.iscloseandnumpy.digitize.
So is there a scope of removing the duplicate bins after fitting with a warning?
Versions
System:
python: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0]
executable: ~/anaconda3/.../bin/python
machine: Linux-4.15.0-20-generic-x86_64-with-debian-buster-sid
BLAS:
macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
lib_dirs: ~/anaconda3/.../lib
cblas_libs: mkl_rt, pthread
Python deps:
pip: 18.1
setuptools: 40.2.0
sklearn: 0.20.1
numpy: 1.15.4
scipy: 1.1.0
Cython: 0.29
pandas: 0.23.4
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
BugEasyWell-defined and straightforward way to resolveWell-defined and straightforward way to resolvehelp wanted