Skip to content

DecisionTreeClassifier should be deterministic for default parameters or documented indicating otherwise #8443

@notmatthancock

Description

@notmatthancock

Description

The default parameters for DecisionTreeClassifier do not indicate randomness, yet there is randomness present, resulting in potentially unexpected behavior. The random_state can of course be set to obtain reproducible results when max_features < n_features or splitter = 'random'. However, with the default setting (i.e., max_features = None and splitter = 'best'), the algorithm is expected to deterministically but does not (see example below).

This appears to be a result of a design choice here where even if max_features = n_features, the algorithm still randomly samples up to max_features.

This can result in different classifiers that have been trained on identical data. This behavior is unexpected when max_features = n_features and splitter != 'random'. The easiest fix is probably just to document this behavior, while the more involved fix would be to augment the code to not sample randomly at each split when max_features = n_features.

Steps/Code to Reproduce

import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
x,y  = iris.data, iris.target

dtc1 = DecisionTreeClassifier(random_state=1)
dtc2 = DecisionTreeClassifier(random_state=2)

rs = np.random.RandomState(1234)
itr = rs.rand(x.shape[0]) < 0.75

dtc1.fit(x[itr],y[itr])
dtc2.fit(x[itr],y[itr])

print (dtc1.predict(x[~itr]) != dtc2.predict(x[~itr])).sum()

Expected Results

Should print 0.

Actual Results

Prints 1.

Versions

Linux-4.4.0-62-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.12.0')
('SciPy', '0.18.1')
('Scikit-Learn', '0.18.1')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions