[MRG+1] discrete branch: add an second example for KBinsDiscretizer by TomDLT · Pull Request #10195 · scikit-learn/scikit-learn

TomDLT · 2017-11-23T17:50:10Z

Reference Issues/PRs

Complementary to #10192, which also tackles issue #9339
It is not intended to replace it, but rather to supplement it.

What does this implement/fix? Explain your changes.

Add another example for KBinsDiscretizer.

Any other comments?

local result

glemaitre · 2017-11-24T00:14:10Z

Looks cool. I'll have a look at it

qinhanmin2014 · 2017-11-24T04:19:17Z

sklearn/preprocessing/discretization.py

        orig_bins = self.n_bins
        if isinstance(orig_bins, numbers.Number):
-            if not isinstance(orig_bins, np.int):
+            if not isinstance(orig_bins, (np.int, np.integer)):


At a glance, this seems not consistent with #10017. Seems that we are going to use (numbers.Integral, np.integer)?

qinhanmin2014

learned a lot from this great example :)

qinhanmin2014 · 2017-11-24T15:48:31Z

examples/preprocessing/plot_discretization_classification.py

+    name = estimator.__class__.__name__
+    if name == 'Pipeline':
+        name = [get_name(est[1]) for est in estimator.steps]
+        name = '\n'.join(name)


This is used both in the plot and the output message. But the '\n' will make the output message strange. See redered page.

qinhanmin2014 · 2017-11-24T15:51:34Z

examples/preprocessing/plot_discretization_classification.py

+        KBinsDiscretizer(encode='onehot'), LinearSVC(random_state=0)), {
+            'kbinsdiscretizer__n_bins': np.arange(2, 10),
+            'linearsvc__C': np.logspace(-2, 7, 10),
+        }),


Why are there not KBinsDiscretizer + GradientBoostingClassifier and KBinsDiscretizer + SVC? Though it seems that KBinsDiscretizer + GradientBoostingClassifier somehow performs better than GradientBoostingClassifier in the first dataset.

Because they are non-linear classifier and the idea is to show that a linear classifier with this transformer can perform as good as a non-linear classifier, isn't it?

Thanks :) Then I suppose it might be better to explicitly state that in the example, it seems a bit confused without your explanation at least from my side.

qinhanmin2014 · 2017-11-24T15:56:58Z

examples/preprocessing/plot_discretization_classification.py

+        clf = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=5)
+        clf.fit(X_train, y_train)
+        score = clf.score(X_test, y_test)
+        print(ds_cnt, name, score)


the message seems too simple from my side, maybe it's better to add something? (e.g., add 'dataset' before ds_cnt (e.g., dataset 1) and add 'score:' before score (e.g., score:0.88)?

glemaitre

Actually I like as it is. My only small remark would be that I would find it easier if the scoring metric would by specified in the title or in the legend.

glemaitre · 2017-11-24T16:10:38Z

examples/preprocessing/plot_discretization_classification.py

+    # preprocess dataset, split into training and test part
+    X, y = ds
+    X = StandardScaler().fit_transform(X)
+    X_train, X_test, y_train, y_test = \


Can you avoid to use \ and jump a line between the parentheses instead

Even if I see that it was the same in the original example ;)

jnothman · 2017-11-26T03:46:01Z

examples/preprocessing/plot_discretization_classification.py

+A demonstration of feature discretization on synthetic classification datasets.
+Feature discretization decomposes each feature into a set of bins, here
+equally distributed in width. The discrete values are then one-hot encoded,
+and given to a linear classifier. On the two non-linearly separable datasets,


Hold the reader's hand a bit more: the first two rows represent linearly non-separable datasets (moons and concentric circles) while the third is approximately linearly separable.

jnothman · 2017-11-26T03:46:38Z

examples/preprocessing/plot_discretization_classification.py

+linearly.
+
+The plots show training points in solid colors and testing points
+semi-transparent. The lower right shows the classification accuracy on the test


You either need to describe here or title the plot to show the groupings: linear classifiers, non-linear classifiers, linear classifiers with discretized input.

Perhaps the discretized classifiers should come before the nonlinear ones to emphasise the difference between the nonlinear group and the linear group.

qinhanmin2014

LGTM
@TomDLT Could you please have a look at #10192 so that we can get the two examples in, merge discrete into master and avoid repeatedly solving conflicts? Thanks :)

qinhanmin2014 · 2017-11-27T13:09:08Z

sklearn/preprocessing/discretization.py

        orig_bins = self.n_bins
        if isinstance(orig_bins, numbers.Number):
-            if not isinstance(orig_bins, np.int):
+            if not isinstance(orig_bins, (numbers.Integral, np.integer)):


just note that it will eventually be changed to something like SCALAR_INTEGER_TYPES in #10017

qinhanmin2014 · 2017-11-27T13:14:52Z

sklearn/preprocessing/tests/test_discretization.py

    assert_array_equal(expected, est.transform(X))


+def test_valid_n_bins():


I might prefer to remove the test
(1) KBinsDiscretizer is now a PR(not in master), so it can be considered an improvement inside the PR.
(2) We do not add too much test in #10017. I don't think it deserves a test.
(3) It seems not a regression test?

isinstance(2, np.int) = True isinstance(np.array([2])[0], np.int) = True

(1) This is a PR to the discrete branch, which will end up in the PR from discrete to master.
(2) I would be in favor of this test, even though we may not need something systematic in #10017.
(3) For me:

np.__version__ == '1.13.1' isinstance(np.array([2])[0], np.int) == False

My previous script is under python 2.7.12 and numpy 1.13.3 and I can reproduce yours under python 3.5.4 and numpy 1.13.1. So I think the reason for our difference is numpy version.
Let's keep the test if you think it is appropriate :) LGTM for the great example.

TomDLT · 2017-11-27T13:42:16Z

New rendered example

jnothman · 2017-11-27T20:34:19Z

examples/preprocessing/plot_discretization_classification.py

+features, which easily lead to overfitting when the number of samples is small.
+
+The plots show training points in solid colors and testing points
+semi-transparent. The lower right shows the classification accuracy on the test


I think precede "test set" with "held out"

Okay. This is clear from context.

jnothman · 2017-11-27T22:41:15Z

Merging. Thanks for the nice plot!

Add discretization example

ffd97a3

TomDLT force-pushed the discretization_example branch from 767d777 to ffd97a3 Compare November 23, 2017 19:20

qinhanmin2014 reviewed Nov 24, 2017

View reviewed changes

scikit-learn deleted a comment from codecov bot Nov 24, 2017

qinhanmin2014 reviewed Nov 24, 2017

View reviewed changes

glemaitre reviewed Nov 24, 2017

View reviewed changes

jnothman reviewed Nov 26, 2017

View reviewed changes

reword the doc, reorder plots, add suptitles

57e4e94

TomDLT force-pushed the discretization_example branch from e6e5972 to 57e4e94 Compare November 27, 2017 11:14

scikit-learn deleted a comment from codecov bot Nov 27, 2017

add white text box and fix non-ascii

4657a11

qinhanmin2014 approved these changes Nov 27, 2017

View reviewed changes

qinhanmin2014 changed the title ~~[MRG] discrete branch: add an second example for KBinsDiscretizer~~ [MRG+1] discrete branch: add an second example for KBinsDiscretizer Nov 27, 2017

jnothman approved these changes Nov 27, 2017

View reviewed changes

jnothman merged commit 430af30 into scikit-learn:discrete Nov 27, 2017

jnothman mentioned this pull request Nov 27, 2017

discrete branch: add a compelling example of discretization's benefits #9339

Closed

TomDLT deleted the discretization_example branch November 28, 2017 10:20

		assert_array_equal(expected, est.transform(X))


		def test_valid_n_bins():

Uh oh!

Conversation

TomDLT commented Nov 23, 2017

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

glemaitre commented Nov 24, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomDLT commented Nov 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Nov 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TomDLT commented Nov 27, 2017 •

edited

Loading