-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
DecisionTreeClassifier should be deterministic for default parameters or documented indicating otherwise #8443
Description
Description
The default parameters for DecisionTreeClassifier do not indicate randomness, yet there is randomness present, resulting in potentially unexpected behavior. The random_state can of course be set to obtain reproducible results when max_features < n_features or splitter = 'random'. However, with the default setting (i.e., max_features = None and splitter = 'best'), the algorithm is expected to deterministically but does not (see example below).
This appears to be a result of a design choice here where even if max_features = n_features, the algorithm still randomly samples up to max_features.
This can result in different classifiers that have been trained on identical data. This behavior is unexpected when max_features = n_features and splitter != 'random'. The easiest fix is probably just to document this behavior, while the more involved fix would be to augment the code to not sample randomly at each split when max_features = n_features.
Steps/Code to Reproduce
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
x,y = iris.data, iris.target
dtc1 = DecisionTreeClassifier(random_state=1)
dtc2 = DecisionTreeClassifier(random_state=2)
rs = np.random.RandomState(1234)
itr = rs.rand(x.shape[0]) < 0.75
dtc1.fit(x[itr],y[itr])
dtc2.fit(x[itr],y[itr])
print (dtc1.predict(x[~itr]) != dtc2.predict(x[~itr])).sum()
Expected Results
Should print 0.
Actual Results
Prints 1.
Versions
Linux-4.4.0-62-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.12.0')
('SciPy', '0.18.1')
('Scikit-Learn', '0.18.1')