[MRG+1] feature: add beta-threshold early stopping for decision tree growth by nelson-liu · Pull Request #6954 · scikit-learn/scikit-learn

nelson-liu · 2016-07-03T15:52:19Z

Reference Issue

What does this implement/fix? Explain your changes.

Implements a stopping criterion for decision tree growth by checking if the impurity of a node is less than a user defined threshold beta. if it is, that node is set as a leaf, and no further splits are made on it. Also adds a test.

Any other comments?

I'm not sure if my test is proper. Right now, I create a tree with min_samples_split = 2 and min_samples_leaf = 1 and beta undefined (so 0 by default) and fit it on data. I then assert whether all the leaves have an impurity of 0, as they should due to the values of min_samples_split and min_samples_leaf.

To test beta, I do the same thing (including using the above values of min_samples_split and min_samples_leaf), but add a value to the beta parameter at tree construction and instead check whether the impurity of all the leaves lies within [0,beta).

nelson-liu · 2016-07-03T16:37:56Z

@lesteve can you help me clear the travis cache?

nelson-liu · 2016-07-03T19:41:40Z

ping @glouppe @raghavrv @jmschrei / anyone else for reviews? :)

raghavrv · 2016-07-03T20:33:19Z

sklearn/tree/tests/test_tree.py

+        # verify leaf nodes without beta have impurity 0
+        est = TreeEstimator(max_leaf_nodes=max_leaf_nodes,
+                            random_state=0)
+        est.fit(X, y)


Maybe test for the expected value of beta (=0 right?)

yup, i'll do that.

raghavrv · 2016-07-03T20:37:39Z

You don't seem to validate if the beta is within the allowed range...

raghavrv · 2016-07-03T20:38:47Z

And you should add the beta parameter to the ensemble methods too.

nelson-liu · 2016-07-03T21:25:07Z

@raghavrv by "validate if the beta is within the allowed range" do you mean validate that the input is a non-negative float?

raghavrv · 2016-07-03T22:00:51Z

the input is a non-negative float?

yes. beta has a range [0, 1]

raghavrv · 2016-07-03T22:01:08Z

And thanks for the PR!

nelson-liu · 2016-07-03T22:09:18Z

yes. beta has a range [0, 1]

Can't beta be greater than 1, since the possible impurity values can be greater than 1 (in the case of regression)?

raghavrv · 2016-07-03T22:22:02Z

Yes. Sorry for not being clear. You'll have to consider classification and regression separately and validate them separately I think...

raghavrv · 2016-07-03T22:25:43Z

~~Also wait entropy can be greater than 1. Current change is good. Leave it as such.~~

raghavrv · 2016-07-03T22:29:46Z

Argh sorry entropy is never greater than one. I was thinking that gini impurity can be greater than one, but since we use gini coefficient it will also be within range [0, 1]. So you should indeed special case classification and validate beta for it separately.

raghavrv · 2016-07-03T22:33:59Z

Also sorry for focussing on triviality. BTW don't we need a warm start with this early stopping method? One where you can reset the beta and continue splitting the nodes that were not split fully?

nelson-liu · 2016-07-03T22:56:32Z

So you should indeed special case classification and validate beta for it separately.

Done, sorry I didn't understand what you meant the first time around. Also added some tests to verify they properly throw errors.

jmschrei · 2016-07-04T00:02:32Z

Warm start would be a good addition in combination with this, but I think that should be a separate PR.

jmschrei · 2016-07-04T14:36:33Z

I changed my mind--I think this should actually be min_impurity_split. There's no reason to name it a random greek character when we can explicitly name it what it is.

jmschrei · 2016-07-04T14:40:51Z

sklearn/ensemble/forest.py


+    beta : float, optional (default=0.)
+        Threshold for early stopping in tree growth. If the impurity
+        of a node is below the threshold, the node is a leaf.


Might want to be more explicit here, saying that a node will split if its impurity is above the min_impurity_split, otherwise is a leaf.

raghavrv · 2016-07-04T14:58:59Z

There's no reason to name it a random greek character when we can explicitly name it what it is.

+1 This would align well with our existing stopping criteria params...

jmschrei · 2016-07-04T15:00:21Z

sklearn/tree/tree.py

+            raise ValueError("beta must be a float")
+        if is_classification:
+            if not 0. <= beta <= 1.:
+                raise ValueError("beta must be in range [0., 1.] "


It is true that classification shouldn't be above 1.0, but entropy has a stricter bound depending on the number of classes. I can't remember if they are reweighted to scale to 1.0 though? It might be better to just take in a positive number and let users figure it out.

raghavrv · 2016-07-04T16:47:52Z

Also please add a whatsnew entry under new features.

nelson-liu · 2016-07-04T19:49:52Z

@raghavrv done! appveyor seems to be failing in an odd way so i repushed to trigger another build...

nelson-liu · 2016-07-05T04:58:10Z

hmm, seems like appveyor failures are related to #4016

nelson-liu · 2016-07-05T05:01:12Z

hmm... i just noticed that the appveyor tests on github redirect to https://ci.appveyor.com/project/agramfort/scikit-learn/build/1.0.276, which is on @agramfort 's account. Is there any reason why we aren't using the sklearn-ci account (it passes tests there)? https://ci.appveyor.com/project/sklearn-ci/scikit-learn/build/1.0.6961

agramfort · 2016-07-05T08:25:39Z

I may have clicked something on appveyor... ping @ogrisel ?

raghavrv · 2016-07-16T13:12:35Z

I feel this is good to go. Thanks. @glouppe a second review and merge?

glouppe · 2016-07-16T13:24:08Z

sklearn/tree/_tree.pyx

-                is_leaf = is_leaf or (impurity <= MIN_IMPURITY_SPLIT)
+                is_leaf = (is_leaf or
+                           (impurity <= MIN_IMPURITY_SPLIT) or
+                           (impurity < min_impurity_split))


This is clearly confusing.

i agree! I renamed it to LEAF_MIN_IMPURITY but i think that's also a little bit confusing. do you have any suggestions for suitable names?

raghavrv · 2016-07-16T13:28:31Z

@glouppe @jmschrei BTW do you think warm start would be a good thing to work on next?

nelson-liu · 2016-07-22T15:53:46Z

@glouppe is there anything else that needs to be done on this PR?

glouppe · 2016-07-27T11:50:15Z

sklearn/ensemble/gradient_boosting.py

                 subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                 min_samples_leaf=1, min_weight_fraction_leaf=0.,
-                 max_depth=3, init=None, random_state=None,
+                 max_depth=3, min_impurity_split=0.,init=None, random_state=None,


also, missing space after the comma

glouppe · 2016-07-27T11:51:09Z

LGTM once the defaults are properly changed to 10e-7.

glouppe · 2016-07-27T15:35:11Z

Thanks Nelson! I'll wait for our friend Travis to arrive, and I'll merge.

nelson-liu · 2016-07-27T15:36:27Z

Sorry for that oversight! not quite sure what I was thinking, changing the docstrings and not the actual code 😝 thanks again for taking a look @glouppe

jmschrei · 2016-07-27T15:37:52Z

👍

glouppe · 2016-07-27T15:47:20Z

Bim! Happy to see GSoC efforts to materialize :)

amueller · 2016-07-27T17:00:48Z

sklearn/ensemble/forest.py


+    min_impurity_split : float, optional (default=1e-7)
+        Threshold for early stopping in tree growth. A node will split
+        if its impurity is above the threshold, otherwise it is a leaf.


versionadded is missing.

amueller · 2016-07-27T17:02:55Z

This is great, thanks! I would be awsome if there was some example, though, and please add the "versionadded" tag to all the docstrings

nelson-liu · 2016-07-27T17:31:24Z

oops, didn't realize the need for the "versionadded" tags, thanks. What sort of example were you thinking? An inline one in the docs, or a full-fledged example in the examples/ directory? I'm thinking of adding one to show how changing the value of the parameter affects the number of nodes in the tree, was that what you had in mind?

I'll go ahead and add these in a new PR

amueller · 2016-07-27T18:19:00Z

Maybe an example that discusses the many pre-pruning options and how they change the tree? I think a full-fledge example on pruning would be good, in particular if we get post-pruning at some point.

nelson-liu · 2016-07-27T19:12:08Z

@amueller what pre-pruning methods in particular were you thinking about? The ones i'm thinking of are min_impurity_split, max_leaf_nodes, max_depth?

amueller · 2016-07-27T19:21:51Z

yeah. Maybe also min_samples_leaf?

nelson-liu · 2016-08-06T00:50:05Z

@amueller I wrote a preliminary version of what could become an example as a GSoC blog post, could you take a quick look at let me know what you think / what extra content you think should be added for an example? link is: http://blog.nelsonliu.me/2016/08/06/gsoc-week-10-pr-6954-prepruning-decision-trees/

…growth (scikit-learn#6954) * feature: add beta-threshold early stopping for decision tree growth * check if value of beta is greater than or equal to 0 * test if default value of beta is 0 and edit input validation error message * feature: separately validate beta for reg. and clf., and add tests for it * feature: add beta to forest-based ensemble methods * feature: add separate condition to determine that beta is float * feature: add beta to gradient boosting estimators * rename parameter to min_impurity_split, edit input validation and associated tests * chore: fix spacing in forest and force recompilation of grad boosting extension * remove trivial comment in grad boost and add whats new * edit wording in test comment / rebuild * rename constant with the same name as our parameter * edit line length for what's new * remove constant and set min_impurity_split to 1e-7 by default * fix docstrings for new default * fix defaults in gradientboosting and forest classes

arjoly · 2016-09-15T13:38:37Z

Great, thanks @nelson-liu

…growth (scikit-learn#6954) * feature: add beta-threshold early stopping for decision tree growth * check if value of beta is greater than or equal to 0 * test if default value of beta is 0 and edit input validation error message * feature: separately validate beta for reg. and clf., and add tests for it * feature: add beta to forest-based ensemble methods * feature: add separate condition to determine that beta is float * feature: add beta to gradient boosting estimators * rename parameter to min_impurity_split, edit input validation and associated tests * chore: fix spacing in forest and force recompilation of grad boosting extension * remove trivial comment in grad boost and add whats new * edit wording in test comment / rebuild * rename constant with the same name as our parameter * edit line length for what's new * remove constant and set min_impurity_split to 1e-7 by default * fix docstrings for new default * fix defaults in gradientboosting and forest classes

raghavrv reviewed Jul 3, 2016
View reviewed changes

jmschrei reviewed Jul 4, 2016
View reviewed changes

glouppe reviewed Jul 16, 2016
View reviewed changes

glouppe reviewed Jul 27, 2016
View reviewed changes

fix defaults in gradientboosting and forest classes

9fce4fc

glouppe merged commit 376aa50 into scikit-learn:master Jul 27, 2016

amueller reviewed Jul 27, 2016
View reviewed changes

nelson-liu mentioned this pull request Jul 27, 2016

Tests failing on master due to tree recythonization issues #7094

Closed

amueller mentioned this pull request Feb 19, 2017

min_impurity_split in tree is very odd behavior #8400

Closed

jnothman mentioned this pull request Feb 19, 2017

addition of min_impurity_split is not in changelog #8401

Closed

Uh oh!

Conversation

nelson-liu commented Jul 3, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

nelson-liu commented Jul 3, 2016

Uh oh!

nelson-liu commented Jul 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Jul 3, 2016

Uh oh!

raghavrv commented Jul 3, 2016

Uh oh!

nelson-liu commented Jul 3, 2016

Uh oh!

raghavrv commented Jul 3, 2016

Uh oh!

raghavrv commented Jul 3, 2016

Uh oh!

nelson-liu commented Jul 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Jul 3, 2016

Uh oh!

raghavrv commented Jul 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Jul 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Jul 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nelson-liu commented Jul 3, 2016

Uh oh!

jmschrei commented Jul 4, 2016

Uh oh!

jmschrei commented Jul 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Jul 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Jul 4, 2016

Uh oh!

nelson-liu commented Jul 4, 2016

Uh oh!

nelson-liu commented Jul 5, 2016

Uh oh!

nelson-liu commented Jul 5, 2016

Uh oh!

agramfort commented Jul 5, 2016 via email

Uh oh!

raghavrv commented Jul 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Jul 16, 2016

Uh oh!

nelson-liu commented Jul 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glouppe Jul 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glouppe commented Jul 27, 2016

Uh oh!

nelson-liu commented Jul 3, 2016 •

edited

Loading

raghavrv commented Jul 3, 2016 •

edited

Loading

raghavrv commented Jul 3, 2016 •

edited

Loading

raghavrv commented Jul 3, 2016 •

edited

Loading

raghavrv commented Jul 4, 2016 •

edited

Loading

glouppe Jul 27, 2016 •

edited

Loading

nelson-liu commented Jul 27, 2016 •

edited

Loading