[MRG] ENH simplify transform for uniform output in QuantileTransformer#12827
[MRG] ENH simplify transform for uniform output in QuantileTransformer#12827glemaitre merged 8 commits intoscikit-learn:masterfrom
Conversation
| # for inverse transform, match a uniform distribution | ||
| with np.errstate(invalid='ignore'): # hide NaN comparison warnings | ||
| X_col = output_distribution.cdf(X_col) | ||
| if output_distribution == 'normal': |
There was a problem hiding this comment.
how about we use != 'uniform' to make this a bit more future-ready.
There was a problem hiding this comment.
and still use getattr(stats, output_distribution) as well?
There was a problem hiding this comment.
Yeah, okay... I'm not too fussed.
| # for inverse transform, match a uniform distribution | ||
| with np.errstate(invalid='ignore'): # hide NaN comparison warnings | ||
| X_col = output_distribution.cdf(X_col) | ||
| if output_distribution == 'normal': |
There was a problem hiding this comment.
Yeah, okay... I'm not too fussed.
|
Ok I just fixed a typo in the doc. Thanks for the reviews @jnothman and @peterkinalex. |
| # find the value to clip the data to avoid mapping to | ||
| # infinity. Clip such that the inverse transform will be | ||
| # consistent | ||
| clip_min = stats.norm.ppf(BOUNDS_THRESHOLD - np.spacing(1)) |
There was a problem hiding this comment.
Maybe we should clip the data when output_distribution == 'uniform' to avoid backward incompatibility and to keep inverse_transform consistent with transform? though the difference seems small (1e-7)
(Actually I agree that we don't need to clip when output_distribution == 'uniform', but if we decide to do so, we'll need to update inverse_transform accordingly)
There was a problem hiding this comment.
In fact I think that for the uniform distribution the clipping is done L2246
X_col[upper_bounds_idx] = upper_bound_y
X_col[lower_bounds_idx] = lower_bound_ywhich is done for transform and inverse_transform
There was a problem hiding this comment.
But we're talking about different things right? @albertcthomas
import numpy as np
from sklearn.preprocessing import QuantileTransformer
rng = np.random.RandomState(0)
X = [[1], [2], [3], [4]]
qt = QuantileTransformer(n_quantiles=10, random_state=0)
qt.fit_transform(X)
Before your PR:
array([[9.99999998e-08],
[3.33333333e-01],
[6.66666667e-01],
[9.99999900e-01]])
After your PR
array([[0. ],
[0.33333333],
[0.66666667],
[1. ]])
Although the new version might be more reasonable.
There was a problem hiding this comment.
I'm even wondering whether we need clip_min and clip_max when output_distribution == 'normal'.
There was a problem hiding this comment.
the easiest solution IMO will be to avoid backward incompatibility (i.e., clip when output_distribution == 'uniform') and open an issue to discuss whether we need clip_min and clip_max.
There was a problem hiding this comment.
Ah yes thanks for the clarification.
There was a problem hiding this comment.
for me it's a bug fix as there is no reason to clip for uniform distribution. Difference is also super tiny (order of float32 eps which is what is used for BOUND_THRESHOLDS). I am fine with the current PR as proposed.
There was a problem hiding this comment.
for me it's a bug fix as there is no reason to clip for uniform distribution. Difference is also super tiny (order of float32 eps which is what is used for BOUND_THRESHOLDS). I am fine with the current PR as proposed.
Happy to regard it as a bug fix.
If so, we need to update inverse_transform accordingly. (around L2222)
|
I should have some time to take your review into account in the next few weeks @qinhanmin2014. Thanks for the review and sorry for the delay |
|
I think we should be good now. I also added a test that fails if we take BOUNDS_THRESHOLD into account for the uniform distribution. |
qinhanmin2014
left a comment
There was a problem hiding this comment.
LGTM, I don't think we need a what's new entry since the difference is small, but feel free to add one if you want.
|
We need someone to double check this PR, ping @jnothman @agramfort |
|
Thanks @qinhanmin2014. It could be useful to have the opinion of @glemaitre or @ogrisel as they worked a lot on the original implementation. |
| upper_bounds_idx = (X_col + BOUNDS_THRESHOLD > | ||
| upper_bound_x) | ||
| if output_distribution == 'uniform': | ||
| lower_bounds_idx = (X_col == lower_bound_x) |
There was a problem hiding this comment.
This is still a float comparison, isn't it?
| lower_bounds_idx = (X_col == lower_bound_x) | |
| lower_bounds_idx = np.isclose(X_col, lower_bound_x) |
There was a problem hiding this comment.
The test that I added is failing with np.isclose. For transform, lower_bound is a float but one element of X_col (the min). For inverse_transform, lower_bound is an int.
Co-Authored-By: albertcthomas <albertthomas88@gmail.com>
|
Thanks @glemaitre! |
|
Thanks albert
…On Tue, 26 Feb 2019 at 17:46, Albert Thomas ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/preprocessing/data.py
<#12827 (comment)>
:
> # find index for lower and higher bounds
with np.errstate(invalid='ignore'): # hide NaN comparison warnings
- lower_bounds_idx = (X_col - BOUNDS_THRESHOLD <
- lower_bound_x)
- upper_bounds_idx = (X_col + BOUNDS_THRESHOLD >
- upper_bound_x)
+ if output_distribution == 'normal':
+ lower_bounds_idx = (X_col - BOUNDS_THRESHOLD <
+ lower_bound_x)
+ upper_bounds_idx = (X_col + BOUNDS_THRESHOLD >
+ upper_bound_x)
+ if output_distribution == 'uniform':
+ lower_bounds_idx = (X_col == lower_bound_x)
The test that I added is failing with np.isclose. For transform,
lower_bound is a float but one element of X_col (the min). For
inverse_transform, lower_bound is an int.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12827 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHG9P3lbP3k67BEXzcse98YS2gKPuiirks5vRWTRgaJpZM4ZZ1Gi>
.
--
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
|
…rmer (scikit-learn#12827)" This reverts commit ad918d4.
…rmer (scikit-learn#12827)" This reverts commit ad918d4.


Reference Issues/PRs
Fixes #12775
What does this implement/fix? Explain your changes.
Any other comments?
Check whether we can use the samples instead ofI will try this on my side and open a separate PR if successful.n_quantiles.