Skip to content

DecisionTreeClassifier doesn't distinguish between numerical and categorical data #12398

@ribes96

Description

@ribes96

Description

The DecisionTreeClassifier's User Guide says that

scikit-learn uses an optimised version of the CART algorithm

but official CART algorithm distinguishes between categorical and continuous variables. This is explained in the original Breiman's definition of CART Classification and Regression Trees
and also in Wikipedia.

According to the documentation, the training input samples are converted to np.float32

The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

So it is clearly not supporting categorical variables.

I don't know if this is a problem of the algorithm implementation or if it is the intended behaviour and so the documentation should be changed.

Steps/Code to Reproduce

Expected Results

Either not saying that

scikit-learn uses an optimised version of the CART algorithm

or distinguish between categorical and numerical variables.

Actual Results

Documentation states the CART is used but that is not true

Versions

pip: 9.0.1
setuptools: 40.4.3
sklearn: 0.20.0
numpy: 1.15.2
scipy: 1.1.0
Cython: None
pandas: 0.23.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions