[MRG+1] fix #6101 GradientBoosting decision_function for sparse inputs#6116
[MRG+1] fix #6101 GradientBoosting decision_function for sparse inputs#6116jnothman merged 12 commits intoscikit-learn:masterfrom
Conversation
|
This makes my example #6101 work. Thanks! |
|
This looks mostly good to me. You should squash the commits as well. @glouppe can you take a look? |
|
Have you check the prediction speed for single samples? There is a benchmark in the benchmarks folder, I think. |
|
@amueller Hmm, it shouldn't slow down anything, because this PR only adds prediction functionality for sparse matricies. Also, can you point me at that benchmark? I can't find anything related to GradientBoosting in benchmark folder. |
|
Sorry for late response, could someone review it? @amueller, @glouppe On same dataset dense prediction takes ~958ms, sparse ~1.2s I've made some benchmark based on bench_20_newsgroups.py: |
|
This fix is incredibly useful for very sparse data sets ( >95% 0 values). Converting a medium sized data set with a 60k x 3k matrix from dense to sparse reduces training time from hours to minutes (on a c3.8xlarge AWS single machine). Any chance we can get this merged into the next release? Or at least into develop? I'm getting annoyed building this from source for the last 2 months. |
|
From a quick read, this looks good, besides some minor cosmit issues. |
| Py_ssize_t K, | ||
| Py_ssize_t n_samples, | ||
| Py_ssize_t n_features, | ||
| float64 *out): |
There was a problem hiding this comment.
Should be indented with DTYPE_t, not with the opening parenthesis.
|
I had a quick read-through.. Apart from @glouppe 's comments above, this looks great. |
|
This seems extremely useful. Should be merged as soon as comments are addressed. |
|
@olologin or whoever does the merge should remember to add a what's new entry. |
7c10b48 to
4ce106e
Compare
|
@glouppe , @jaquesgrobler , @jnothman . Fixed, AppVeyour build fails, but seems it's not my fault. And sorry for delay, I had to finish paperwork with my university. |
|
ping @jnothman |
|
please rebase. |
doc/whats_new.rst
Outdated
| By `Sebastian Säger`_ and `YenChen Lin`_. | ||
|
|
||
| - :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor` | ||
| now support sparse input for ``predict`` method. |
4ce106e to
0816606
Compare
|
I've not looked at this yet. |
|
Btw, at a skim this looks good, but I'd like to look through it more closely. |
doc/whats_new.rst
Outdated
| By `Sebastian Säger`_ and `YenChen Lin`_. | ||
|
|
||
| - :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor` | ||
| now support sparse input for the ``predict`` method. |
There was a problem hiding this comment.
might as well say for "prediction" to include other prediction methods.
|
Now that I think about it actually, maybe it would be worth adding a test ensuring that both the dense and sparse versions run within a factor of 2 of each other? @amueller what is your position on timings in unit tests? I don't want this to deprecate in the future. |
|
Timings in unit tests are problematic. Relative timings are going to be dependent on nonzero density in X, apart from architecture issues. I'm -1 for such tests, though it is worth benchmarking (on one architecture) at PR time to see that nothing crazy is happening. |
…ree_inplace_fast_dense fixed
056f780 to
8d21c0c
Compare
|
LGTM. This has my +1. |
doc/whats_new.rst
Outdated
| <https://github.com/scikit-learn/scikit-learn/pull/6178>`_) by `Bertrand | ||
| Thirion`_ | ||
|
|
||
| - :class:`ensemble.GradientBoostingClassifier` and :class:`ensemble.GradientBoostingRegressor` |
There was a problem hiding this comment.
I think this belongs under enhancements
|
Move the what's new and we'll merge. Thanks! |
|
Thanks for review 👍 |
|
as was written |
|
I see so your code looks like this from sklearn.datasets import fetch_20newsgroups_vectorized from sklearn.ensemble import GradientBoostingClassifier data_train = fetch_20newsgroups_vectorized(subset="train") X_test_dense = X_test_sp.todense() print("20 newsgroups") print("Classifier Training") name = "GradientBoostingClassifier_100_trees" print("Training %s ... " % name, end="") %timeit clf.predict(X_test_dense) |
|
??? |
|
Code needs to be between |
like this? |
|
great it works!!!!! |
Fix for issue #6101
Please make suggestions.