[MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch)#11272
[MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (discrete branch)#11272jnothman merged 13 commits intoscikit-learn:discretefrom
Conversation
2f8ad7a to
ded8c41
Compare
ded8c41 to
2b8c99f
Compare
aaddc11 to
21b9356
Compare
21b9356 to
54522af
Compare
|
Nice!
|
jnothman
left a comment
There was a problem hiding this comment.
I think we should treat this as if it'll be released with KBinsDiscretizer. We should set the default strategy to 'quantile'. We should consider API changes to make this more consistent. We should update the what's new message rather than adding one.
I was a bit surprised to see you've only added, and not modified tests. But the only thing I'm missing there is checking the consistency of 1-column and multi-column transformations.
doc/modules/preprocessing.rst
Outdated
| However, histograms focus on counting features which fall into particular | ||
| bins, whereas discretization focuses on assigning feature values to these bins. | ||
|
|
||
| :class:`KBinsDiscretizer` implement different binning strategies, which can be |
|
|
||
| bin_width_ : float array, shape (n_features,) | ||
| bin_width_ : array, shape (n_features,) | ||
| Contain floats with 'uniform' strategy, and arrays of varying shapes |
There was a problem hiding this comment.
Do we now want to store widths, or edges? Do we lose a lot by storing edges in all cases?
| kmeans | ||
| Widths are defined by a k-means on each features. | ||
|
|
||
| random_state : int, RandomState instance or None (default) |
There was a problem hiding this comment.
Seeing as we're doing it in 1d, can't we just initialise kmeans at the midpoints of a uniform split, and make it deterministic?
| n_clusters = self.n_bins_[jj] | ||
| km = KMeans(n_clusters=n_clusters, | ||
| random_state=self.random_state).fit(col) | ||
| centers = np.sort(km.cluster_centers_[:, 0], |
There was a problem hiding this comment.
I don't think we'd need this sort if we initialised with linspace
|
|
||
| if self.strategy == 'quantile': | ||
| n_quantiles = self.n_bins_[jj] + 1 | ||
| qt = QuantileTransformer(n_quantiles=n_quantiles).fit(col) |
There was a problem hiding this comment.
This seems overkill. Just use boundaries = np.percentile(col, n_quantiles + 1)
| valid_strategy = ('uniform', 'quantile', 'kmeans') | ||
| if self.strategy not in valid_strategy: | ||
| raise ValueError("Valid options for 'strategy' are {}. " | ||
| "Got 'strategy = {}' instead." |
| def test_nonuniform_strategies(): | ||
| X = np.array([0, 1, 2, 3, 9, 10]).reshape(-1, 1) | ||
|
|
||
| # with 2 bins |
There was a problem hiding this comment.
Please also test the size of bin_widths_
|
Thanks for the review.
|
520fa7b to
785c14f
Compare
And I suppose we're not worried about a slow-down in the uniform case? |
|
Good point, if you think it's important, I can add a branch in |
jnothman
left a comment
There was a problem hiding this comment.
Great work. All nitpicks. LGTM!
| X /= bin_width | ||
| bin_edges = self.bin_edges_[trans] | ||
| for jj in range(X.shape[1]): | ||
| # Values which are a multiple of the bin width are susceptible to |
There was a problem hiding this comment.
"a multiple of the bin width" -> "close to the bin edge"
| for jj in range(X.shape[1]): | ||
| # Values which are a multiple of the bin width are susceptible to | ||
| # numeric instability. Add eps to X so these values are binned | ||
| # correctly. See documentation for numpy.isclose for an explanation |
There was a problem hiding this comment.
Maybe "correctly" -> "correctly with respect to their decimal truncation"???
| edges = np.r_[col.min(), edges, col.max()] | ||
|
|
||
| if edges[0] == edges[-1] and self.n_bins_[jj] > 2: | ||
| warnings.warn("Features %d is constant and will be " |
| Widths are defined by a quantile transform, to have a uniform | ||
| number of samples in each bin. | ||
| kmeans | ||
| Widths are defined by a k-means on each features. |
There was a problem hiding this comment.
features -> feature
Perhaps:
"Values in each bin have the same nearest center of a k-means cluster."
| uniform | ||
| All bins in each feature have identical widths. | ||
| quantile | ||
| Widths are defined by a quantile transform, to have a uniform |
There was a problem hiding this comment.
"All bins have the same number of points"
I don't yet have reason to believe it's important... But yes, we could consider that now or in the future. |
e5586dc to
24a8745
Compare
|
I think once this is merged, we should move the discretizer to |
|
Would @qinhanmin2014 like to give this a second look? |
|
I promise to give a review next weekend, and a second review from someone else before next weekend is welcomed. Sorry for the delay but I'm surrounded by school work at the end of the semester. |
|
That's fine! Just thought you might be a good person to give it a go if
available.
|
qinhanmin2014
left a comment
There was a problem hiding this comment.
The code LGTM. Will look at the examples and tests in a couple of hours.
| edges = np.asarray(np.percentile(col, quantiles)) | ||
|
|
||
| elif self.strategy == 'kmeans': | ||
| from ..cluster import KMeans |
There was a problem hiding this comment.
Any specific reason to import here?
There was a problem hiding this comment.
Importing loop problem, but it seems not to happen anymore so I moved it to the top.
|
|
||
| >>> est.transform(X) # doctest: +SKIP | ||
| array([[ 0., 2., 1.], | ||
| array([[ 0., 1., 1.], |
There was a problem hiding this comment.
If I understand correctly, the result change because we're now using strategy='quantile', right?
If so, I think we need to modify the intervals above, otherwise they're not consistent:
- feature 1: :math:`{[-\infty, 0), [0, 3), [3, \infty)}`
- feature 2: :math:`{[-\infty, 4), [4, 5), [5, \infty)}`
- feature 3: :math:`{[-\infty, 13), [13, \infty)}`
| from ..utils.fixes import np_version | ||
|
|
||
|
|
||
| class KBinsDiscretizer(BaseEstimator, TransformerMixin): |
There was a problem hiding this comment.
Need to update the doc below:
Bins continuous data into k equal width intervals. is no longer the case I think.
|
Without looking at the code, 'auto' in so if we want a better default, it's not hard to implement.... |
qinhanmin2014
left a comment
There was a problem hiding this comment.
LGTM. I'm much more confident to merge discrete branch after this PR.
| assert_array_equal(expected_3bins, Xt.ravel()) | ||
|
|
||
|
|
||
| def test_kmeans_strategy(): |
There was a problem hiding this comment.
Could you explain the purpose of the test (maybe a comment)? It seems a bit strange (though I understand how it pass) and honestly seems redundant from my side.
There was a problem hiding this comment.
It is indeed redundant so I removed it.
| ========================================================== | ||
|
|
||
| This example presents the different strategies implemented in KBinsDiscretizer: | ||
| - 'uniform': The discretization is uniform in each feature, which means that |
There was a problem hiding this comment.
Formatting issue according to Circle. I guess a blank line will solve the problem.
|
Further regarding default |
|
Thanks for the review ! I changed the default from |
| bin_edges[jj] = np.asarray(np.percentile(column, quantiles)) | ||
|
|
||
| elif self.strategy == 'kmeans': | ||
| from ..cluster import KMeans # fixes import loops |
There was a problem hiding this comment.
So what's the final decision here? You said in #11272 (comment) that you'll move the import to the top but you actually leave it here.
There was a problem hiding this comment.
Please do leave it here. I don't think importing from sklearn.preprocessing should automatically result in importing DBSCAN, etc. Better off keeping the loading as light as possible.
| bin_edges[jj] = (centers[1:] + centers[:-1]) * 0.5 | ||
| bin_edges[jj] = np.r_[col_min, bin_edges[jj], col_max] | ||
|
|
||
| if bin_edges[jj][0] == bin_edges[jj][-1] and n_bins[jj] > 2: |
There was a problem hiding this comment.
I'm fine with handling edge cases, but:
(1) Why don't you handle situations when n_bins[jj] == 2?
(2) It seems strange that we reject n_bins = 1 and manually set n_bins_ = 1 in edge cases. Will it be reasonable to accept n_bins = 1?
There was a problem hiding this comment.
It's not useful/meaningful to accept n_bins = 1...
There was a problem hiding this comment.
Fine, then changing n_bins[jj] > 2 to n_bins[jj] >= 2 seems enough from my side, or sorry if I'm wrong.
There was a problem hiding this comment.
Also, maybe we need a test for the edge case.
doc/modules/preprocessing.rst
Outdated
| >>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 3, 2], encode='ordinal').fit(X) | ||
| >>> est.bin_width_ | ||
| array([3., 1., 2.]) | ||
| >>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 4, 2], encode='ordinal').fit(X) |
There was a problem hiding this comment.
Is it good to have n_bins > number of samples in the example? I might still prefer n_bins=[3, 3, 2].
| def test_same_min_max(): | ||
| @pytest.mark.parametrize('strategy', ['uniform', 'kmeans', 'quantile']) | ||
| @pytest.mark.parametrize('n_bins', [2, 3]) | ||
| def test_same_min_max(strategy, n_bins): |
There was a problem hiding this comment.
Do we test n_bins_[0] = 1 somewhere? Or do you think we don't need such a test?
Also, maybe we don't need to test multiple n_bins here?
There was a problem hiding this comment.
n_bins = 1 is tested in arrays (test_invalid_n_bins_array) and as an integer (test_invalid_n_bins).
There was a problem hiding this comment.
Oh I misread your comment, I'll add a line to test n_bins_.
qinhanmin2014
left a comment
There was a problem hiding this comment.
LGTM if CIs are green. ping @jnothman for a final check.
Fixes #9338.
This PR implements two non-uniform strategies for KBinsDiscretizer (in discrete branch #9342), adding a parameter
strategy:KBinsDiscretizer(strategy='uniform'): previous behavior, constant width bins.KBinsDiscretizer(strategy='quantile'): new behavior (default), based onnp.percentile.KBinsDiscretizer(strategy='kmeans'): new behavior, based onKMeans.It adds the following example:
