[MRG+2] discrete branch: add an example for KBinsDiscretizer#10192
[MRG+2] discrete branch: add an example for KBinsDiscretizer#10192qinhanmin2014 merged 10 commits intoscikit-learn:discretefrom qinhanmin2014:discrete-example
Conversation
|
ping @jnothman Could you help me diagnose the Circle failure? Do we need to merge master into discrete again? Thanks a lot. |
TomDLT
left a comment
There was a problem hiding this comment.
Do we need to merge master into discrete again?
Done, Circle is not failing anymore.
| linestyle=':', label='decision tree') | ||
| plt.plot(X[:, 0], y, 'o', c='k') | ||
| bins = enc.offset_[0] + enc.bin_width_[0] * np.arange(1, enc.n_bins_[0]) | ||
| plt.vlines(bins, -3, 3, linewidth=1, alpha=.2) |
There was a problem hiding this comment.
To have automatic ymin, ymax, you can use
plt.vlines(bins, *plt.gca().get_ylim(), linewidth=1, alpha=.2)| # construct the dataset | ||
| rnd = np.random.RandomState(42) | ||
| X = rnd.uniform(-3, 3, size=100) | ||
| y_no_noise = (np.sin(4 * X) + X) |
| plt.title("Result before discretization") | ||
|
|
||
| # predict with transformed dataset | ||
| plt.subplot(122) |
There was a problem hiding this comment.
The comparison is clearer if the subplots have the same ylim (sharey=True). You can do:
fig, axes = plt.subplots(nrows=2, sharey=True, figsize=(10, 4))
plt.sca(axes[0])
...
plt.sca(axes[1])
...|
@TomDLT Thanks a lot for your great help :) |
Hmmm that is a bit weird I would expect that checkout_merge_commit.sh would take care of that ... |
|
ping @TomDLT Thanks for your instructions. Comments adressed :) Result from Circle page ping @jnothman I think it's ready for review. Thanks :) ping @amueller The example is based on your book "Introduction to machine learning with python" (Chapter 4 section 2 binning&discretization) using scikit-learn's latest api KBinsDiscretizer. Would be grateful if you can take some time to have a look (I've put your name at the beginning). Thanks :) |
jnothman
left a comment
There was a problem hiding this comment.
I've not yet looked at the code...
| ================================================================ | ||
|
|
||
| The example compares prediction result of linear regression (linear model) | ||
| and decision tree (tree based model) before and after discretization. |
There was a problem hiding this comment.
Before and after -> with and without
| before discretization, linear model become much more flexible while decision | ||
| tree gets much less flexible. Note that binning features generally has no | ||
| beneficial effect for tree-based models, as these models can learn to split | ||
| up the data anywhere. |
There was a problem hiding this comment.
Note that the linear model is fast to build and relatively straightforward to interpret.
|
And yes, I find this a much more compelling argument than what we had before with iris. |
jnothman
left a comment
There was a problem hiding this comment.
Please mention one-hot encoding.
Also note that if the bins are not reasonably wide, there would appear to be a substantially increased risk of overfitting, so the discretiser parameters need tuning under cv.
|
@jnothman Comments addressed. Thanks a lot for the instant review :) |
| is to use discretization (also known as binning). In the example, we | ||
| discretize the feature and one-hot encode the transformed data. Note that if | ||
| the bins are not reasonably wide, there would appear to be a substantially | ||
| increased risk of overfitting, so the discretiser parameters need to be tuned |
There was a problem hiding this comment.
Need -> should usually
seeing as you don't do that tuning here
|
|
||
| # predict with original dataset | ||
| fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(10, 4)) | ||
| plt.sca(axes[0]) |
There was a problem hiding this comment.
Do we use sca much in other examples? It seems a bit unconventional. Then again, I can see how using axes methods together with pyplot.subplots might be seen as inconsistent
|
@jnothman Thanks :) Comments addressed.
In fact, the value of n_bins here (10) is among the best choices from cv (if we gridsearch on a pipeline KBinsDiscretizer + LinearRegression). So I think put the value directly is consistant with this statement and might make the example easier to go through. Especially considering that we now have another example which shows the detailed tuning process.
I don't find plt.sca under examples folder. I have followed matplotlib example and use a more common way. |
| discretize the feature and one-hot encode the transformed data. Note that if | ||
| the bins are not reasonably wide, there would appear to be a substantially | ||
| increased risk of overfitting, so the discretiser parameters should usually | ||
| be tuned under cv. |
| is to use discretization (also known as binning). In the example, we | ||
| discretize the feature and one-hot encode the transformed data. Note that if | ||
| the bins are not reasonably wide, there would appear to be a substantially | ||
| increased risk of overfitting, so the discretiser parameters should usually |
|
Merging giving the approvals from jnothman and TomDLT. |


Reference Issues/PRs
Fixes #9339
What does this implement/fix? Explain your changes.
Add an example for KBinsDiscretizer
reference: "Introduction to machine learning with python" (Chapter 4 section 2)
Any other comments?
local result
