Pprett/gradient boosting by glouppe · Pull Request #6 · pprett/scikit-learn

glouppe · 2012-03-19T14:43:46Z

This is my first bunch of commits regarding your PR.

I really like how you managed to remove the "terminal" mechanisms from the Tree code :)

My changes are the following:

Moved _compute_feature_importances into Tree
Moved _build_tree into Tree
Use DTYPE instead of float64
Cosmits and pep8

Most of those do not actually concern the boosting module. I still have to review the gradient_boosting.py file into more depth. (Later today or tomorrow).

pprett · 2012-03-19T19:41:51Z

@glouppe some of the tests fail due to numerical issues (an aftermath of changing dtype). I fixed those but I notice a performance regression for the following benchmark::

import numpy as np
from sklearn import datasets
from sklearn.ensemble import gradient_boosting

X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)
X = X.astype(np.float32)

gbrt = gradient_boosting.GradientBoostingClassifier(n_estimators=250,
                                                min_samples_split=5,
                                                max_depth=1,
                                                learn_rate=1.0,
                                                random_state=0)
%timeit gbrt.fit(X, y)

it goes from::

1 loops, best of 3: 1.32 s per loop

to::

1 loops, best of 3: 1.97 s per loop

pprett · 2012-03-19T19:51:20Z

hmm... I think I hunted it down::

379       250       768998   3076.0     42.2              residual = loss.negative_gradient(y, y_pred, k=k)

This is 4 times the usual timing due to y and y_pred having different dtype.

pprett · 2012-03-19T19:58:12Z

sklearn/tree/tree.py

why should init_error or best_error have type DTYPE which is the dtype of the data array? Either use np.float32 or np.float64. I tend to use np.float64 whenever possible (i.e. when memory consumption is not an issue).

pprett · 2012-03-19T20:09:13Z

wow... seems like 32bit floating point arithmetic in numpy is substantially slower than 64bit arithmetic::

%timeit bd.negative_gradient(y, y_pred)
1000 loops, best of 3: 546 us per loop

vs 32bit::

%timeit bd.negative_gradient(y_float32, y_pred_float32)
100 loops, best of 3: 3.01 ms per loop

it seems that np.exp is the one to blame.

glouppe · 2012-03-19T20:49:33Z

Wow that's huge. I was not aware of this. Actually, my machine is 32 bits that's the reason why I like to have the possibility to not use float64. I will have a deeper look at it tomorrow. I'll revert my changes if I come to no good solution.

pprett · 2012-03-19T20:58:12Z

it might be slower on 64bit machines but a 6-fold increase is too
large - numpy has a npy_expf function that operates on float32 but
I don't know whether it is exposed to the numpy API... i keep you
posted

2012/3/19 Gilles Louppe
reply@reply.github.com:

Wow that's huge. I was not aware of this. Actually, my machine is 32 bits that's the reason why I like to have the possibility to not use float64. I will have a deeper look at it tomorrow. I'll revert my changes if I come to no good solution.

Reply to this email directly or view it on GitHub:
#6 (comment)

Peter Prettenhofer

pprett · 2012-03-19T21:32:27Z

Gilles, I just checked the other (regression) models in sklearn, it seems that only tree and ensemble use 32bit floating point for the target values. SVM and Lasso/ElasticNet/SGDRegressor explicitly convert to 64bit. I'd rather use 64bit for tree and ensemble too - this has the advantage that results are more stable (I remember we use np.mean(y) somewhere in our code which might pose a underflow problem) - AFAIK we choose 32bit because of memory consumption which is only an issue for X but not y.

glouppe · 2012-03-19T22:08:03Z

Okay, I agree. I'll revert my changes tomorrow.

On 19 March 2012 22:32, Peter Prettenhofer
reply@reply.github.com
wrote:

Gilles, I just checked the other (regression) models in sklearn, it seems that only tree and ensemble use 32bit floating point for the target values. SVM and Lasso/ElasticNet/SGDRegressor explicitly convert to 64bit. I'd rather use 64bit for tree and ensemble too - this has the advantage that results are more stable (I remember we use np.mean(y) somewhere in our code which might pose a underflow problem) - AFAIK we choose 32bit because of memory consumption which is only an issue for X but not y.

Reply to this email directly or view it on GitHub:
#6 (comment)

This reverts commit 3509e16. Conflicts: sklearn/ensemble/gradient_boosting.py sklearn/tree/tree.py

glouppe · 2012-03-20T09:09:14Z

I just pushed a reverse commit.

pprett · 2012-03-20T09:37:30Z

@glouppe thanks - I updated whats_new.rst and merged

nitpick fixes, pep8 and fix math equations

Revised text classification chapter

glouppe added 6 commits March 19, 2012 13:39

PEP8

f6a36a3

ENH: move _compute_feature_importance into Tree

ff09f48

ENH: Use DTYPE instead of float64

3509e16

Cosmit

ec9ed89

ENH: Moved _build_tree into Tree

934d373

Cosmits + Fix to a test

47a335b

pprett reviewed Mar 19, 2012
View reviewed changes

Revert "ENH: Use DTYPE instead of float64"

cc2bab9

This reverts commit 3509e16. Conflicts: sklearn/ensemble/gradient_boosting.py sklearn/tree/tree.py

pprett merged commit cc2bab9 into pprett:gradient_boosting Mar 20, 2012

pprett pushed a commit that referenced this pull request Jul 25, 2013

Merge pull request #6 from jaquesgrobler/cov_doc_fix

4e7e637

nitpick fixes, pep8 and fix math equations

pprett pushed a commit that referenced this pull request Mar 18, 2014

Merge pull request #6 from larsmans/master

5e57728

Revised text classification chapter

pprett pushed a commit that referenced this pull request Mar 18, 2014

remove reference to removed API, fixes #6

54fb2de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pprett/gradient boosting#6

Pprett/gradient boosting#6
pprett merged 7 commits intopprett:gradient_boostingfrom
glouppe:pprett/gradient_boosting

glouppe commented Mar 19, 2012

Uh oh!

pprett commented Mar 19, 2012

Uh oh!

pprett commented Mar 19, 2012

Uh oh!

pprett Mar 19, 2012

Uh oh!

pprett commented Mar 19, 2012

Uh oh!

glouppe commented Mar 19, 2012

Uh oh!

pprett commented Mar 19, 2012

Uh oh!

pprett commented Mar 19, 2012

Uh oh!

glouppe commented Mar 19, 2012

Uh oh!

glouppe commented Mar 20, 2012

Uh oh!

pprett commented Mar 20, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

glouppe commented Mar 19, 2012

Uh oh!

pprett commented Mar 19, 2012

Uh oh!

pprett commented Mar 19, 2012

Uh oh!

pprett Mar 19, 2012

Choose a reason for hiding this comment

Uh oh!

pprett commented Mar 19, 2012

Uh oh!

glouppe commented Mar 19, 2012

Uh oh!

pprett commented Mar 19, 2012

Uh oh!

pprett commented Mar 19, 2012

Uh oh!

glouppe commented Mar 19, 2012

Uh oh!

glouppe commented Mar 20, 2012

Uh oh!

pprett commented Mar 20, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants