[MRG+1] fix #6101 GradientBoosting decision_function for sparse inputs by olologin · Pull Request #6116 · scikit-learn/scikit-learn

olologin · 2016-01-05T16:56:51Z

Fix for issue #6101
Please make suggestions.

aflaxman · 2016-01-14T21:10:36Z

This makes my example #6101 work. Thanks!

jmschrei · 2016-01-15T18:41:47Z

This looks mostly good to me. You should squash the commits as well. @glouppe can you take a look?

amueller · 2016-01-15T20:02:40Z

Have you check the prediction speed for single samples? There is a benchmark in the benchmarks folder, I think.

olologin · 2016-01-16T09:26:54Z

@amueller Hmm, it shouldn't slow down anything, because this PR only adds prediction functionality for sparse matricies. Also, can you point me at that benchmark? I can't find anything related to GradientBoosting in benchmark folder.

olologin · 2016-04-24T04:47:38Z

Sorry for late response, could someone review it? @amueller, @glouppe

On same dataset dense prediction takes ~958ms, sparse ~1.2s

20 newsgroups
=============
X_train.shape = (11314, 130107)
X_train density = 0.001214353154362896
y_train (11314,)
X_test (3500, 130107)
X_test.format = csr
X_test.dtype = float32
y_test (3500,)

Classifier Training
===================
Training GradientBoostingClassifier_100_trees ...

1 loop, best of 3: 958 ms per loop
1 loop, best of 3: 1.2 s per loop

I've made some benchmark based on bench_20_newsgroups.py:

from __future__ import print_function, division
import numpy as np

from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.utils.validation import check_array

from sklearn.ensemble import GradientBoostingClassifier

data_train = fetch_20newsgroups_vectorized(subset="train")
data_test = fetch_20newsgroups_vectorized(subset="test")
X_train_sp = check_array(data_train.data, dtype=np.float32,
                      accept_sparse="csc")
checked_test = check_array(data_test.data, dtype=np.float32, accept_sparse="csr")
X_test_sp = checked_test[:3500, :]
y_train_sp = data_train.target
y_test_sp = data_test.target[:3500]

X_test_dense = X_test_sp.todense()

print("20 newsgroups")
print("=============")
print("X_train.shape = {0}".format(X_train_sp.shape))
print("X_train density = {0}"
      "".format(X_train_sp.nnz / np.product(X_train_sp.shape)))
print("y_train {0}".format(y_train_sp.shape))
print("X_test {0}".format(X_test_sp.shape))
print("X_test.format = {0}".format(X_test_sp.format))
print("X_test.dtype = {0}".format(X_test_sp.dtype))
print("y_test {0}".format(y_test_sp.shape))
print()

print("Classifier Training")
print("===================")
accuracy, test_time = {}, {}

name = "GradientBoostingClassifier_100_trees"
clf = GradientBoostingClassifier(n_estimators=100)
try:
    clf.set_params(random_state=0)
except (TypeError, ValueError):
    pass

print("Training %s ... " % name, end="")
clf.fit(X_train_sp, y_train_sp)

%timeit clf.predict(X_test_dense)
%timeit clf.predict(X_test_sp)

l3link · 2016-06-20T03:38:06Z

This fix is incredibly useful for very sparse data sets ( >95% 0 values). Converting a medium sized data set with a 60k x 3k matrix from dense to sparse reduces training time from hours to minutes (on a c3.8xlarge AWS single machine). Any chance we can get this merged into the next release? Or at least into develop? I'm getting annoyed building this from source for the last 2 months.

glouppe · 2016-06-21T05:34:31Z

From a quick read, this looks good, besides some minor cosmit issues.

glouppe · 2016-06-21T05:35:17Z

sklearn/ensemble/_gradient_boosting.pyx

+                                                     Py_ssize_t K,
+                                                     Py_ssize_t n_samples,
+                                                     Py_ssize_t n_features,
+                                                     float64 *out):


Should be indented with DTYPE_t, not with the opening parenthesis.

jaquesgrobler · 2016-06-21T07:49:35Z

I had a quick read-through.. Apart from @glouppe 's comments above, this looks great.
Very useful fix this. I'm +1 for merging this once last points are addressed 👍

jmschrei · 2016-06-21T07:54:12Z

This seems extremely useful. Should be merged as soon as comments are addressed.

jnothman · 2016-06-21T12:06:27Z

@olologin or whoever does the merge should remember to add a what's new entry.

olologin · 2016-06-26T10:27:23Z

@glouppe , @jaquesgrobler , @jnothman .

Fixed, AppVeyour build fails, but seems it's not my fault.

And sorry for delay, I had to finish paperwork with my university.

olologin · 2016-07-19T16:51:49Z

ping @jnothman

amueller · 2016-07-28T17:52:21Z

please rebase.

amueller · 2016-07-28T17:52:57Z

doc/whats_new.rst

     By `Sebastian Säger`_ and `YenChen Lin`_.

+   - :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor`
+     now support sparse input for ``predict`` method.


*the predict method

amueller · 2016-07-28T17:55:53Z

@glouppe @jnothman does this have your +1 and can be merged, or should I review?

jnothman · 2016-07-31T03:11:08Z

I've not looked at this yet.

jnothman · 2016-08-02T00:20:29Z

Btw, at a skim this looks good, but I'd like to look through it more closely.

jnothman · 2016-08-04T00:16:27Z

doc/whats_new.rst

     By `Sebastian Säger`_ and `YenChen Lin`_.

+   - :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor`
+     now support sparse input for the ``predict`` method.


might as well say for "prediction" to include other prediction methods.

jmschrei · 2016-10-14T02:56:37Z

Now that I think about it actually, maybe it would be worth adding a test ensuring that both the dense and sparse versions run within a factor of 2 of each other? @amueller what is your position on timings in unit tests? I don't want this to deprecate in the future.

jnothman · 2016-10-14T04:27:47Z

Timings in unit tests are problematic. Relative timings are going to be dependent on nonzero density in X, apart from architecture issues. I'm -1 for such tests, though it is worth benchmarking (on one architecture) at PR time to see that nothing crazy is happening.

…ree_inplace_fast_dense fixed

jmschrei · 2016-10-15T05:27:09Z

LGTM. This has my +1.

jnothman · 2016-10-15T10:27:01Z

doc/whats_new.rst

     <https://github.com/scikit-learn/scikit-learn/pull/6178>`_) by `Bertrand
     Thirion`_

+   - :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor`


I think this belongs under enhancements

jnothman · 2016-10-15T10:27:28Z

Move the what's new and we'll merge. Thanks!

olologin · 2016-10-15T10:38:17Z

Thanks for review 👍

…arse inputs (scikit-learn#6116)

Sandy4321 · 2018-05-18T16:09:49Z

as was written
olologin commented on Aug 27, 2016
I fixed performance issue, now it works almost as fast as dense version in test provided by @ogrisel above. 2.773s for dense and 3.104s for sparse.
Also I've found and fixed stupid mistake in safe_realloc usage from tree.pyx and in function for sparse prediction which I added here. It required more memory to allocate than user needs
if somebody may share a test case code

rth · 2018-05-18T17:39:14Z

@Sandy4321 see #6101 (comment)

Sandy4321 · 2018-05-23T14:40:23Z

I see so your code looks like this
'''
from future import print_function, division
import numpy as np

from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.utils.validation import check_array

from sklearn.ensemble import GradientBoostingClassifier

data_train = fetch_20newsgroups_vectorized(subset="train")
data_test = fetch_20newsgroups_vectorized(subset="test")
X_train_sp = check_array(data_train.data, dtype=np.float32,
accept_sparse="csc")
checked_test = check_array(data_test.data, dtype=np.float32, accept_sparse="csr")
X_test_sp = checked_test[:3500, :]
y_train_sp = data_train.target
y_test_sp = data_test.target[:3500]

X_test_dense = X_test_sp.todense()

print("20 newsgroups")
print("=============")
print("X_train.shape = {0}".format(X_train_sp.shape))
print("X_train density = {0}"
"".format(X_train_sp.nnz / np.product(X_train_sp.shape)))
print("y_train {0}".format(y_train_sp.shape))
print("X_test {0}".format(X_test_sp.shape))
print("X_test.format = {0}".format(X_test_sp.format))
print("X_test.dtype = {0}".format(X_test_sp.dtype))
print("y_test {0}".format(y_test_sp.shape))
print()

print("Classifier Training")
print("===================")
accuracy, test_time = {}, {}

name = "GradientBoostingClassifier_100_trees"
clf = GradientBoostingClassifier(n_estimators=100)
try:
clf.set_params(random_state=0)
except (TypeError, ValueError):
pass

print("Training %s ... " % name, end="")
clf.fit(X_train_sp, y_train_sp)

%timeit clf.predict(X_test_dense)
%timeit clf.predict(X_test_sp)
'''

Sandy4321 · 2018-05-23T14:40:56Z

???
I put code between ''' and '''

rth · 2018-05-23T15:24:00Z

Code needs to be between ``` not ''' :)

Sandy4321 · 2018-05-23T20:40:03Z


my code

like this?

Sandy4321 · 2018-05-23T20:40:17Z

great it works!!!!!
thanks

olologin changed the title ~~GradientBoosting decision_function~~ GradientBoosting decision_function for sparse inputs Jan 5, 2016

olologin changed the title ~~GradientBoosting decision_function for sparse inputs~~ [MRG] fix #6101 GradientBoosting decision_function for sparse inputs Jan 8, 2016

aflaxman mentioned this pull request Jan 14, 2016

GradientBoostingClassifier.fit accepts sparse X, but .predict does not #6101

Closed

olologin mentioned this pull request May 1, 2016

[MRG+2] parallelized VotingClassifier #5805

Merged

jnothman added the Waiting for Reviewer label Jun 21, 2016

glouppe reviewed Jun 21, 2016
View reviewed changes

olologin force-pushed the GradientBoostingFix branch from 7c10b48 to 4ce106e Compare June 26, 2016 09:15

amueller reviewed Jul 28, 2016
View reviewed changes

amueller added this to the 0.18 milestone Jul 28, 2016

olologin force-pushed the GradientBoostingFix branch from 4ce106e to 0816606 Compare July 28, 2016 18:29

olologin changed the title ~~[MRG] fix #6101 GradientBoosting decision_function for sparse inputs~~ [MRG+1] fix #6101 GradientBoosting decision_function for sparse inputs Jul 30, 2016

jnothman reviewed Aug 4, 2016
View reviewed changes

olologin added 10 commits October 14, 2016 19:50

small refactoring

aa4050a

Now GradientBoosting predict method works with csr_matrix

016fa02

Description updated

038e2a6

Tests for GradientBoosting updated

255771f

What's new entry added

9d00cfa

cython.nonecheck(False) removed, Indentation in _predict_regression_t…

3686f0a

…ree_inplace_fast_dense fixed

Refactoring of _gradient_boosting.pyx

f114697

Sparse prediction speed is faster now, also fixed safe_alloc usage

d3488b3

assert_array_equal -> assert_array_almost_equal

f2f6d27

indentation fix

8d21c0c

olologin force-pushed the GradientBoostingFix branch from 056f780 to 8d21c0c Compare October 14, 2016 17:56

Minor changes

73e1bbb

jnothman reviewed Oct 15, 2016

View reviewed changes

what's new entry moved to enhancement section

9a77120

jnothman merged commit 78dbcb2 into scikit-learn:master Oct 15, 2016

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG] fix scikit-learn#6101 GradientBoosting decision_function for sp…

f4642cf

…arse inputs (scikit-learn#6116)

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG] fix scikit-learn#6101 GradientBoosting decision_function for sp…

cb36bae

…arse inputs (scikit-learn#6116)

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG] fix scikit-learn#6101 GradientBoosting decision_function for sp…

a47830a

…arse inputs (scikit-learn#6116)

Uh oh!

Conversation

olologin commented Jan 5, 2016

Uh oh!

aflaxman commented Jan 14, 2016

Uh oh!

jmschrei commented Jan 15, 2016

Uh oh!

amueller commented Jan 15, 2016

Uh oh!

olologin commented Jan 16, 2016

Uh oh!

olologin commented Apr 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

l3link commented Jun 20, 2016

Uh oh!

glouppe commented Jun 21, 2016

Uh oh!

glouppe Jun 21, 2016

Choose a reason for hiding this comment

Uh oh!

jaquesgrobler commented Jun 21, 2016

Uh oh!

jmschrei commented Jun 21, 2016

Uh oh!

jnothman commented Jun 21, 2016

Uh oh!

olologin commented Jun 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

olologin commented Jul 19, 2016

Uh oh!

amueller commented Jul 28, 2016

Uh oh!

amueller Jul 28, 2016

Choose a reason for hiding this comment

Uh oh!

amueller commented Jul 28, 2016

Uh oh!

jnothman commented Jul 31, 2016

Uh oh!

jnothman commented Aug 2, 2016

Uh oh!

jnothman Aug 4, 2016

Choose a reason for hiding this comment

Uh oh!

jmschrei commented Oct 14, 2016

Uh oh!

jnothman commented Oct 14, 2016

Uh oh!

jmschrei commented Oct 15, 2016

Uh oh!

jnothman Oct 15, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman commented Oct 15, 2016

Uh oh!

olologin commented Oct 15, 2016

Uh oh!

Sandy4321 commented May 18, 2018

Uh oh!

rth commented May 18, 2018

Uh oh!

Sandy4321 commented May 23, 2018

Uh oh!

Sandy4321 commented May 23, 2018

Uh oh!

rth commented May 23, 2018

Uh oh!

Sandy4321 commented May 23, 2018

Uh oh!

Sandy4321 commented May 23, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

olologin commented Apr 24, 2016 •

edited

Loading

olologin commented Jun 26, 2016 •

edited

Loading