Skip to content

Feature Request: Hellinger split criterion for classificaiton trees #9947

@Gitman-code

Description

@Gitman-code

Currently tree classifiers like sklearn.tree.DecisionTreeClassifier have two options for split criterion, “gini” and "entropy". These are sensitive to imbalanced datasets. I would like to request the addition of the Hellinger split criterion since it is insensitive to imbalanced datasets. The documentation and motivation is outlined in the following papers.

https://www.researchgate.net/publication/220451886_Hellinger_distance_decision_trees_are_robust_and_skew-insensitive
https://www.researchgate.net/publication/262225473_Hellinger_Distance_Trees_for_Imbalanced_Streams

Example code (non-python) is contained within. There are some implementations in other languages as well. For example
https://www3.nd.edu/~dial//software.html

We have a parameter "class_weight" and the use of Hellinger Trees has been shown to be more useful than reweighting so it makes sense to add it.

This would require an update to
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pxd
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx

as well as adding the option to each of the classifiers. These include:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.ensemble.RandomForestClassifier

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions