[MRG+1] discrete branch: add an second example for KBinsDiscretizer #10195
[MRG+1] discrete branch: add an second example for KBinsDiscretizer #10195jnothman merged 3 commits intoscikit-learn:discretefrom
Conversation
767d777 to
ffd97a3
Compare
|
Looks cool. I'll have a look at it |
| orig_bins = self.n_bins | ||
| if isinstance(orig_bins, numbers.Number): | ||
| if not isinstance(orig_bins, np.int): | ||
| if not isinstance(orig_bins, (np.int, np.integer)): |
There was a problem hiding this comment.
At a glance, this seems not consistent with #10017. Seems that we are going to use (numbers.Integral, np.integer)?
qinhanmin2014
left a comment
There was a problem hiding this comment.
learned a lot from this great example :)
| name = estimator.__class__.__name__ | ||
| if name == 'Pipeline': | ||
| name = [get_name(est[1]) for est in estimator.steps] | ||
| name = '\n'.join(name) |
There was a problem hiding this comment.
This is used both in the plot and the output message. But the '\n' will make the output message strange. See redered page.
| KBinsDiscretizer(encode='onehot'), LinearSVC(random_state=0)), { | ||
| 'kbinsdiscretizer__n_bins': np.arange(2, 10), | ||
| 'linearsvc__C': np.logspace(-2, 7, 10), | ||
| }), |
There was a problem hiding this comment.
Why are there not KBinsDiscretizer + GradientBoostingClassifier and KBinsDiscretizer + SVC? Though it seems that KBinsDiscretizer + GradientBoostingClassifier somehow performs better than GradientBoostingClassifier in the first dataset.
There was a problem hiding this comment.
Because they are non-linear classifier and the idea is to show that a linear classifier with this transformer can perform as good as a non-linear classifier, isn't it?
There was a problem hiding this comment.
Thanks :) Then I suppose it might be better to explicitly state that in the example, it seems a bit confused without your explanation at least from my side.
| clf = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=5) | ||
| clf.fit(X_train, y_train) | ||
| score = clf.score(X_test, y_test) | ||
| print(ds_cnt, name, score) |
There was a problem hiding this comment.
the message seems too simple from my side, maybe it's better to add something? (e.g., add 'dataset' before ds_cnt (e.g., dataset 1) and add 'score:' before score (e.g., score:0.88)?
glemaitre
left a comment
There was a problem hiding this comment.
Actually I like as it is. My only small remark would be that I would find it easier if the scoring metric would by specified in the title or in the legend.
| # preprocess dataset, split into training and test part | ||
| X, y = ds | ||
| X = StandardScaler().fit_transform(X) | ||
| X_train, X_test, y_train, y_test = \ |
There was a problem hiding this comment.
Can you avoid to use \ and jump a line between the parentheses instead
There was a problem hiding this comment.
Even if I see that it was the same in the original example ;)
| A demonstration of feature discretization on synthetic classification datasets. | ||
| Feature discretization decomposes each feature into a set of bins, here | ||
| equally distributed in width. The discrete values are then one-hot encoded, | ||
| and given to a linear classifier. On the two non-linearly separable datasets, |
There was a problem hiding this comment.
Hold the reader's hand a bit more: the first two rows represent linearly non-separable datasets (moons and concentric circles) while the third is approximately linearly separable.
| linearly. | ||
|
|
||
| The plots show training points in solid colors and testing points | ||
| semi-transparent. The lower right shows the classification accuracy on the test |
There was a problem hiding this comment.
You either need to describe here or title the plot to show the groupings: linear classifiers, non-linear classifiers, linear classifiers with discretized input.
There was a problem hiding this comment.
Perhaps the discretized classifiers should come before the nonlinear ones to emphasise the difference between the nonlinear group and the linear group.
e6e5972 to
57e4e94
Compare
| orig_bins = self.n_bins | ||
| if isinstance(orig_bins, numbers.Number): | ||
| if not isinstance(orig_bins, np.int): | ||
| if not isinstance(orig_bins, (numbers.Integral, np.integer)): |
There was a problem hiding this comment.
just note that it will eventually be changed to something like SCALAR_INTEGER_TYPES in #10017
| assert_array_equal(expected, est.transform(X)) | ||
|
|
||
|
|
||
| def test_valid_n_bins(): |
There was a problem hiding this comment.
I might prefer to remove the test
(1) KBinsDiscretizer is now a PR(not in master), so it can be considered an improvement inside the PR.
(2) We do not add too much test in #10017. I don't think it deserves a test.
(3) It seems not a regression test?
isinstance(2, np.int) = True
isinstance(np.array([2])[0], np.int) = TrueThere was a problem hiding this comment.
(1) This is a PR to the discrete branch, which will end up in the PR from discrete to master.
(2) I would be in favor of this test, even though we may not need something systematic in #10017.
(3) For me:
np.__version__ == '1.13.1'
isinstance(np.array([2])[0], np.int) == FalseThere was a problem hiding this comment.
My previous script is under python 2.7.12 and numpy 1.13.3 and I can reproduce yours under python 3.5.4 and numpy 1.13.1. So I think the reason for our difference is numpy version.
Let's keep the test if you think it is appropriate :) LGTM for the great example.
| features, which easily lead to overfitting when the number of samples is small. | ||
|
|
||
| The plots show training points in solid colors and testing points | ||
| semi-transparent. The lower right shows the classification accuracy on the test |
There was a problem hiding this comment.
I think precede "test set" with "held out"
There was a problem hiding this comment.
Okay. This is clear from context.
|
Merging. Thanks for the nice plot! |
Reference Issues/PRs
Complementary to #10192, which also tackles issue #9339
It is not intended to replace it, but rather to supplement it.
What does this implement/fix? Explain your changes.
Add another example for KBinsDiscretizer.
Any other comments?
local result
