-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Feature Request: Hellinger split criterion for classificaiton trees #9947
Description
Currently tree classifiers like sklearn.tree.DecisionTreeClassifier have two options for split criterion, “gini” and "entropy". These are sensitive to imbalanced datasets. I would like to request the addition of the Hellinger split criterion since it is insensitive to imbalanced datasets. The documentation and motivation is outlined in the following papers.
https://www.researchgate.net/publication/220451886_Hellinger_distance_decision_trees_are_robust_and_skew-insensitive
https://www.researchgate.net/publication/262225473_Hellinger_Distance_Trees_for_Imbalanced_Streams
Example code (non-python) is contained within. There are some implementations in other languages as well. For example
https://www3.nd.edu/~dial//software.html
We have a parameter "class_weight" and the use of Hellinger Trees has been shown to be more useful than reweighting so it makes sense to add it.
This would require an update to
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pxd
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx
as well as adding the option to each of the classifiers. These include:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.ensemble.RandomForestClassifier