[MRG+1] feature: add beta-threshold early stopping for decision tree growth#6954
[MRG+1] feature: add beta-threshold early stopping for decision tree growth#6954glouppe merged 16 commits intoscikit-learn:masterfrom
Conversation
|
@lesteve can you help me clear the travis cache? |
| # verify leaf nodes without beta have impurity 0 | ||
| est = TreeEstimator(max_leaf_nodes=max_leaf_nodes, | ||
| random_state=0) | ||
| est.fit(X, y) |
There was a problem hiding this comment.
Maybe test for the expected value of beta (=0 right?)
There was a problem hiding this comment.
yup, i'll do that.
|
You don't seem to validate if the |
|
And you should add the |
|
@raghavrv by "validate if the |
yes. beta has a range |
|
And thanks for the PR! |
Can't beta be greater than 1, since the possible impurity values can be greater than 1 (in the case of regression)? |
|
Yes. Sorry for not being clear. You'll have to consider classification and regression separately and validate them separately I think... |
|
|
|
Argh sorry entropy is never greater than one. I was thinking that gini impurity can be greater than one, but since we use gini coefficient it will also be within range |
|
Also sorry for focussing on triviality. BTW don't we need a warm start with this early stopping method? One where you can reset the beta and continue splitting the nodes that were not split fully? |
Done, sorry I didn't understand what you meant the first time around. Also added some tests to verify they properly throw errors. |
|
Warm start would be a good addition in combination with this, but I think that should be a separate PR. |
|
I changed my mind--I think this should actually be |
sklearn/ensemble/forest.py
Outdated
|
|
||
| beta : float, optional (default=0.) | ||
| Threshold for early stopping in tree growth. If the impurity | ||
| of a node is below the threshold, the node is a leaf. |
There was a problem hiding this comment.
Might want to be more explicit here, saying that a node will split if its impurity is above the min_impurity_split, otherwise is a leaf.
+1 This would align well with our existing stopping criteria params... |
sklearn/tree/tree.py
Outdated
| raise ValueError("beta must be a float") | ||
| if is_classification: | ||
| if not 0. <= beta <= 1.: | ||
| raise ValueError("beta must be in range [0., 1.] " |
There was a problem hiding this comment.
It is true that classification shouldn't be above 1.0, but entropy has a stricter bound depending on the number of classes. I can't remember if they are reweighted to scale to 1.0 though? It might be better to just take in a positive number and let users figure it out.
|
Also please add a whatsnew entry under new features. |
|
@raghavrv done! appveyor seems to be failing in an odd way so i repushed to trigger another build... |
|
hmm, seems like appveyor failures are related to #4016 |
|
hmm... i just noticed that the appveyor tests on github redirect to https://ci.appveyor.com/project/agramfort/scikit-learn/build/1.0.276, which is on @agramfort 's account. Is there any reason why we aren't using the sklearn-ci account (it passes tests there)? https://ci.appveyor.com/project/sklearn-ci/scikit-learn/build/1.0.6961 |
|
I may have clicked something on appveyor...
ping @ogrisel ?
|
|
I feel this is good to go. Thanks. @glouppe a second review and merge? |
sklearn/tree/_tree.pyx
Outdated
| is_leaf = is_leaf or (impurity <= MIN_IMPURITY_SPLIT) | ||
| is_leaf = (is_leaf or | ||
| (impurity <= MIN_IMPURITY_SPLIT) or | ||
| (impurity < min_impurity_split)) |
There was a problem hiding this comment.
This is clearly confusing.
There was a problem hiding this comment.
i agree! I renamed it to LEAF_MIN_IMPURITY but i think that's also a little bit confusing. do you have any suggestions for suitable names?
|
@glouppe is there anything else that needs to be done on this PR? |
| subsample=1.0, criterion='friedman_mse', min_samples_split=2, | ||
| min_samples_leaf=1, min_weight_fraction_leaf=0., | ||
| max_depth=3, init=None, random_state=None, | ||
| max_depth=3, min_impurity_split=0.,init=None, random_state=None, |
There was a problem hiding this comment.
also, missing space after the comma
|
LGTM once the defaults are properly changed to |
|
Thanks Nelson! I'll wait for our friend Travis to arrive, and I'll merge. |
|
Sorry for that oversight! not quite sure what I was thinking, changing the docstrings and not the actual code 😝 thanks again for taking a look @glouppe |
|
👍 |
|
Bim! Happy to see GSoC efforts to materialize :) |
|
|
||
| min_impurity_split : float, optional (default=1e-7) | ||
| Threshold for early stopping in tree growth. A node will split | ||
| if its impurity is above the threshold, otherwise it is a leaf. |
|
This is great, thanks! I would be awsome if there was some example, though, and please add the "versionadded" tag to all the docstrings |
|
oops, didn't realize the need for the "versionadded" tags, thanks. What sort of example were you thinking? An inline one in the docs, or a full-fledged example in the I'll go ahead and add these in a new PR |
|
Maybe an example that discusses the many pre-pruning options and how they change the tree? I think a full-fledge example on pruning would be good, in particular if we get post-pruning at some point. |
|
@amueller what pre-pruning methods in particular were you thinking about? The ones i'm thinking of are |
|
yeah. Maybe also |
|
@amueller I wrote a preliminary version of what could become an example as a GSoC blog post, could you take a quick look at let me know what you think / what extra content you think should be added for an example? link is: http://blog.nelsonliu.me/2016/08/06/gsoc-week-10-pr-6954-prepruning-decision-trees/ |
…growth (scikit-learn#6954) * feature: add beta-threshold early stopping for decision tree growth * check if value of beta is greater than or equal to 0 * test if default value of beta is 0 and edit input validation error message * feature: separately validate beta for reg. and clf., and add tests for it * feature: add beta to forest-based ensemble methods * feature: add separate condition to determine that beta is float * feature: add beta to gradient boosting estimators * rename parameter to min_impurity_split, edit input validation and associated tests * chore: fix spacing in forest and force recompilation of grad boosting extension * remove trivial comment in grad boost and add whats new * edit wording in test comment / rebuild * rename constant with the same name as our parameter * edit line length for what's new * remove constant and set min_impurity_split to 1e-7 by default * fix docstrings for new default * fix defaults in gradientboosting and forest classes
|
Great, thanks @nelson-liu |
…growth (scikit-learn#6954) * feature: add beta-threshold early stopping for decision tree growth * check if value of beta is greater than or equal to 0 * test if default value of beta is 0 and edit input validation error message * feature: separately validate beta for reg. and clf., and add tests for it * feature: add beta to forest-based ensemble methods * feature: add separate condition to determine that beta is float * feature: add beta to gradient boosting estimators * rename parameter to min_impurity_split, edit input validation and associated tests * chore: fix spacing in forest and force recompilation of grad boosting extension * remove trivial comment in grad boost and add whats new * edit wording in test comment / rebuild * rename constant with the same name as our parameter * edit line length for what's new * remove constant and set min_impurity_split to 1e-7 by default * fix docstrings for new default * fix defaults in gradientboosting and forest classes
Reference Issue
Proposed here: #6557 (comment)
What does this implement/fix? Explain your changes.
Implements a stopping criterion for decision tree growth by checking if the impurity of a node is less than a user defined threshold beta. if it is, that node is set as a leaf, and no further splits are made on it. Also adds a test.
Any other comments?
I'm not sure if my test is proper. Right now, I create a tree with
min_samples_split = 2andmin_samples_leaf = 1andbetaundefined (so 0 by default) and fit it on data. I then assert whether all the leaves have an impurity of 0, as they should due to the values ofmin_samples_splitandmin_samples_leaf.To test beta, I do the same thing (including using the above values of
min_samples_splitandmin_samples_leaf), but add a value to thebetaparameter at tree construction and instead check whether the impurity of all the leaves lies within [0,beta).