-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
DecisionTreeClassifier doesn't distinguish between numerical and categorical data #12398
Description
Description
The DecisionTreeClassifier's User Guide says that
scikit-learn uses an optimised version of the CART algorithm
but official CART algorithm distinguishes between categorical and continuous variables. This is explained in the original Breiman's definition of CART Classification and Regression Trees
and also in Wikipedia.
According to the documentation, the training input samples are converted to np.float32
The training input samples. Internally, it will be converted to
dtype=np.float32and if a sparse matrix is provided to a sparsecsc_matrix.
So it is clearly not supporting categorical variables.
I don't know if this is a problem of the algorithm implementation or if it is the intended behaviour and so the documentation should be changed.
Steps/Code to Reproduce
Expected Results
Either not saying that
scikit-learn uses an optimised version of the CART algorithm
or distinguish between categorical and numerical variables.
Actual Results
Documentation states the CART is used but that is not true
Versions
pip: 9.0.1
setuptools: 40.4.3
sklearn: 0.20.0
numpy: 1.15.2
scipy: 1.1.0
Cython: None
pandas: 0.23.4