[MRG+2] discrete branch: add an example for KBinsDiscretizer by qinhanmin2014 · Pull Request #10192 · scikit-learn/scikit-learn

qinhanmin2014 · 2017-11-23T05:27:35Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Add an example for KBinsDiscretizer
reference: "Introduction to machine learning with python" (Chapter 4 section 2)

Any other comments?

local result

qinhanmin2014 · 2017-11-23T06:25:26Z

ping @jnothman Could you help me diagnose the Circle failure? Do we need to merge master into discrete again? Thanks a lot.

TomDLT

Do we need to merge master into discrete again?

Done, Circle is not failing anymore.

TomDLT · 2017-11-23T13:28:44Z

examples/preprocessing/plot_discretization.py

+         linestyle=':', label='decision tree')
+plt.plot(X[:, 0], y, 'o', c='k')
+bins = enc.offset_[0] + enc.bin_width_[0] * np.arange(1, enc.n_bins_[0])
+plt.vlines(bins, -3, 3, linewidth=1, alpha=.2)


To have automatic ymin, ymax, you can use

plt.vlines(bins, *plt.gca().get_ylim(), linewidth=1, alpha=.2)

TomDLT · 2017-11-23T13:28:59Z

examples/preprocessing/plot_discretization.py

+# construct the dataset
+rnd = np.random.RandomState(42)
+X = rnd.uniform(-3, 3, size=100)
+y_no_noise = (np.sin(4 * X) + X)


I would be in favour of something more non-linear, to better show the failure of the linear model.

e.g. y = np.sin(X) + rnd.normal(size=len(X)) / 3:

TomDLT · 2017-11-23T13:31:46Z

examples/preprocessing/plot_discretization.py

+plt.title("Result before discretization")
+
+# predict with transformed dataset
+plt.subplot(122)


The comparison is clearer if the subplots have the same ylim (sharey=True). You can do:

fig, axes = plt.subplots(nrows=2, sharey=True, figsize=(10, 4)) plt.sca(axes[0]) ... plt.sca(axes[1]) ...

qinhanmin2014 · 2017-11-23T13:43:52Z

@TomDLT Thanks a lot for your great help :)

lesteve · 2017-11-23T13:48:11Z

Do we need to merge master into discrete again?

Done, Circle is not failing anymore.

Hmmm that is a bit weird I would expect that checkout_merge_commit.sh would take care of that ...

qinhanmin2014 · 2017-11-23T15:10:07Z

ping @TomDLT Thanks for your instructions. Comments adressed :) Result from Circle page

ping @jnothman I think it's ready for review. Thanks :)

ping @amueller The example is based on your book "Introduction to machine learning with python" (Chapter 4 section 2 binning&discretization) using scikit-learn's latest api KBinsDiscretizer. Would be grateful if you can take some time to have a look (I've put your name at the beginning). Thanks :)

jnothman

I've not yet looked at the code...

jnothman · 2017-11-23T21:12:11Z

examples/preprocessing/plot_discretization.py

+================================================================
+
+The example compares prediction result of linear regression (linear model)
+and decision tree (tree based model) before and after discretization.


Before and after -> with and without

of real-valued features

jnothman · 2017-11-23T21:14:33Z

examples/preprocessing/plot_discretization.py

+before discretization, linear model become much more flexible while decision
+tree gets much less flexible. Note that binning features generally has no
+beneficial effect for tree-based models, as these models can learn to split
+up the data anywhere.


Note that the linear model is fast to build and relatively straightforward to interpret.

jnothman · 2017-11-23T21:17:20Z

And yes, I find this a much more compelling argument than what we had before with iris.

jnothman

Please mention one-hot encoding.

Also note that if the bins are not reasonably wide, there would appear to be a substantially increased risk of overfitting, so the discretiser parameters need tuning under cv.

qinhanmin2014 · 2017-11-24T02:06:58Z

@jnothman Comments addressed. Thanks a lot for the instant review :)

jnothman · 2017-11-26T03:25:11Z

examples/preprocessing/plot_discretization.py

+is to use discretization (also known as binning). In the example, we
+discretize the feature and one-hot encode the transformed data. Note that if
+the bins are not reasonably wide, there would appear to be a substantially
+increased risk of overfitting, so the discretiser parameters need to be tuned


Need -> should usually

seeing as you don't do that tuning here

jnothman · 2017-11-26T03:29:11Z

examples/preprocessing/plot_discretization.py

+
+# predict with original dataset
+fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(10, 4))
+plt.sca(axes[0])


Do we use sca much in other examples? It seems a bit unconventional. Then again, I can see how using axes methods together with pyplot.subplots might be seen as inconsistent

qinhanmin2014 · 2017-11-26T04:26:22Z

@jnothman Thanks :) Comments addressed.

seeing as you don't do that tuning here

In fact, the value of n_bins here (10) is among the best choices from cv (if we gridsearch on a pipeline KBinsDiscretizer + LinearRegression). So I think put the value directly is consistant with this statement and might make the example easier to go through. Especially considering that we now have another example which shows the detailed tuning process.

Do we use sca much in other examples? It seems a bit unconventional.

I don't find plt.sca under examples folder. I have followed matplotlib example and use a more common way.

TomDLT

LGTM

TomDLT · 2017-11-27T13:45:39Z

examples/preprocessing/plot_discretization.py

+discretize the feature and one-hot encode the transformed data. Note that if
+the bins are not reasonably wide, there would appear to be a substantially
+increased risk of overfitting, so the discretiser parameters should usually
+be tuned under cv.


cv -> cross-validation

TomDLT · 2017-11-27T13:45:58Z

examples/preprocessing/plot_discretization.py

+is to use discretization (also known as binning). In the example, we
+discretize the feature and one-hot encode the transformed data. Note that if
+the bins are not reasonably wide, there would appear to be a substantially
+increased risk of overfitting, so the discretiser parameters should usually


discretiser -> discretizer

qinhanmin2014 · 2017-11-27T14:44:07Z

Merging giving the approvals from jnothman and TomDLT.
redered example from Circle.

qinhanmin2014 added 3 commits November 23, 2017 13:17

add example

527309f

add encode

6e5fa5a

minor format

93e0121

change color

290293f

TomDLT reviewed Nov 23, 2017

View reviewed changes

use sparse output & set random state

4cbe31a

address comment

daf6de6

TomDLT mentioned this pull request Nov 23, 2017

[MRG+1] discrete branch: add an second example for KBinsDiscretizer #10195

Merged

jnothman reviewed Nov 23, 2017

View reviewed changes

qinhanmin2014 added 2 commits November 24, 2017 09:09

update comment

f557991

flake8 fix

1ac8cc0

scikit-learn deleted a comment from codecov bot Nov 24, 2017

jnothman reviewed Nov 26, 2017

View reviewed changes

jnothman approved these changes Nov 26, 2017

View reviewed changes

doc + get rid of sca

83090f9

scikit-learn deleted a comment from codecov bot Nov 26, 2017

TomDLT approved these changes Nov 27, 2017

View reviewed changes

TomDLT changed the title ~~[MRG] discrete branch: add an example for KBinsDiscretizer~~ [MRG+2] discrete branch: add an example for KBinsDiscretizer Nov 27, 2017

typo

367c64f

scikit-learn deleted a comment from codecov bot Nov 27, 2017

qinhanmin2014 merged commit 2c9134e into scikit-learn:discrete Nov 27, 2017

jnothman mentioned this pull request Nov 27, 2017

discrete branch: add a compelling example of discretization's benefits #9339

Closed

qinhanmin2014 deleted the discrete-example branch November 28, 2017 01:20

Uh oh!

Conversation

qinhanmin2014 commented Nov 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

qinhanmin2014 commented Nov 23, 2017

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomDLT Nov 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomDLT Nov 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Nov 23, 2017

Uh oh!

lesteve commented Nov 23, 2017

Uh oh!

qinhanmin2014 commented Nov 23, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Nov 23, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Nov 24, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Nov 26, 2017

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Nov 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qinhanmin2014 commented Nov 23, 2017 •

edited

Loading

TomDLT Nov 23, 2017 •

edited

Loading

TomDLT Nov 23, 2017 •

edited

Loading