[MRG] ENH simplify transform for uniform output in QuantileTransformer by albertcthomas · Pull Request #12827 · scikit-learn/scikit-learn

albertcthomas · 2018-12-19T09:17:59Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Add details in doc about what is done in the transform: basically G^-1(F(X)) where F is the empirical cdf of the data and G the output distribution. In the case of the uniform distribution output this becomes F(X). Also add in the doc that such a transform is monotonic and thus preserves the rank of the values along each feature.
Simplify transform for uniform distribution as cdf and ppf of the uniform distribution on [0,1] are the identity function

Any other comments?

~~Check whether we can use the samples instead of n_quantiles.~~ I will try this on my side and open a separate PR if successful.

albertcthomas · 2018-12-19T13:12:24Z

The examples are unchanged:

albertcthomas · 2018-12-19T16:12:54Z

Rendered doc

jnothman · 2018-12-20T02:31:19Z

sklearn/preprocessing/data.py

+            #  for inverse transform, match a uniform distribution
            with np.errstate(invalid='ignore'):  # hide NaN comparison warnings
-                X_col = output_distribution.cdf(X_col)
+                if output_distribution == 'normal':


how about we use != 'uniform' to make this a bit more future-ready.

and still use getattr(stats, output_distribution) as well?

Yeah, okay... I'm not too fussed.

jnothman · 2018-12-23T02:39:04Z

sklearn/preprocessing/data.py

+            #  for inverse transform, match a uniform distribution
            with np.errstate(invalid='ignore'):  # hide NaN comparison warnings
-                X_col = output_distribution.cdf(X_col)
+                if output_distribution == 'normal':


Yeah, okay... I'm not too fussed.

albertcthomas · 2018-12-23T13:40:24Z

Ok I just fixed a typo in the doc. Thanks for the reviews @jnothman and @peterkinalex.

qinhanmin2014 · 2019-01-22T08:35:05Z

sklearn/preprocessing/data.py

+                    # find the value to clip the data to avoid mapping to
+                    # infinity. Clip such that the inverse transform will be
+                    # consistent
+                    clip_min = stats.norm.ppf(BOUNDS_THRESHOLD - np.spacing(1))


Maybe we should clip the data when output_distribution == 'uniform' to avoid backward incompatibility and to keep inverse_transform consistent with transform? though the difference seems small (1e-7)
(Actually I agree that we don't need to clip when output_distribution == 'uniform', but if we decide to do so, we'll need to update inverse_transform accordingly)

gentle ping if you have time here :) @albertcthomas

In fact I think that for the uniform distribution the clipping is done L2246

X_col[upper_bounds_idx] = upper_bound_y X_col[lower_bounds_idx] = lower_bound_y

which is done for transform and inverse_transform

But we're talking about different things right? @albertcthomas

import numpy as np from sklearn.preprocessing import QuantileTransformer rng = np.random.RandomState(0) X = [[1], [2], [3], [4]] qt = QuantileTransformer(n_quantiles=10, random_state=0) qt.fit_transform(X)

Before your PR:

array([[9.99999998e-08], [3.33333333e-01], [6.66666667e-01], [9.99999900e-01]])

After your PR

array([[0. ], [0.33333333], [0.66666667], [1. ]])

Although the new version might be more reasonable.

I'm even wondering whether we need clip_min and clip_max when output_distribution == 'normal'.

the easiest solution IMO will be to avoid backward incompatibility (i.e., clip when output_distribution == 'uniform') and open an issue to discuss whether we need clip_min and clip_max.

Ah yes thanks for the clarification.

for me it's a bug fix as there is no reason to clip for uniform distribution. Difference is also super tiny (order of float32 eps which is what is used for BOUND_THRESHOLDS). I am fine with the current PR as proposed.

for me it's a bug fix as there is no reason to clip for uniform distribution. Difference is also super tiny (order of float32 eps which is what is used for BOUND_THRESHOLDS). I am fine with the current PR as proposed.

Happy to regard it as a bug fix.
If so, we need to update inverse_transform accordingly. (around L2222)

albertcthomas · 2019-02-03T22:31:07Z

I should have some time to take your review into account in the next few weeks @qinhanmin2014. Thanks for the review and sorry for the delay

albertcthomas · 2019-02-25T16:11:45Z

I think we should be good now. I also added a test that fails if we take BOUNDS_THRESHOLD into account for the uniform distribution.

sklearn/preprocessing/data.py

qinhanmin2014

LGTM, I don't think we need a what's new entry since the difference is small, but feel free to add one if you want.

qinhanmin2014 · 2019-02-26T14:17:19Z

We need someone to double check this PR, ping @jnothman @agramfort

albertcthomas · 2019-02-26T14:40:01Z

Thanks @qinhanmin2014. It could be useful to have the opinion of @glemaitre or @ogrisel as they worked a lot on the original implementation.

glemaitre · 2019-02-26T16:24:33Z

sklearn/preprocessing/data.py

+                upper_bounds_idx = (X_col + BOUNDS_THRESHOLD >
+                                    upper_bound_x)
+            if output_distribution == 'uniform':
+                lower_bounds_idx = (X_col == lower_bound_x)


This is still a float comparison, isn't it?

Suggested change

lower_bounds_idx = (X_col == lower_bound_x)

lower_bounds_idx = np.isclose(X_col, lower_bound_x)

The test that I added is failing with np.isclose. For transform, lower_bound is a float but one element of X_col (the min). For inverse_transform, lower_bound is an int.

sklearn/preprocessing/tests/test_data.py

glemaitre

Looks good

Co-Authored-By: albertcthomas <albertthomas88@gmail.com>

albertcthomas · 2019-02-26T16:55:58Z

Thanks @glemaitre!

glemaitre · 2019-02-26T17:04:43Z

Thanks albert

…

On Tue, 26 Feb 2019 at 17:46, Albert Thomas ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/preprocessing/data.py <#12827 (comment)> : > # find index for lower and higher bounds with np.errstate(invalid='ignore'): # hide NaN comparison warnings - lower_bounds_idx = (X_col - BOUNDS_THRESHOLD < - lower_bound_x) - upper_bounds_idx = (X_col + BOUNDS_THRESHOLD > - upper_bound_x) + if output_distribution == 'normal': + lower_bounds_idx = (X_col - BOUNDS_THRESHOLD < + lower_bound_x) + upper_bounds_idx = (X_col + BOUNDS_THRESHOLD > + upper_bound_x) + if output_distribution == 'uniform': + lower_bounds_idx = (X_col == lower_bound_x) The test that I added is failing with np.isclose. For transform, lower_bound is a float but one element of X_col (the min). For inverse_transform, lower_bound is an int. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12827 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHG9P3lbP3k67BEXzcse98YS2gKPuiirks5vRWTRgaJpZM4ZZ1Gi> .

-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/

…ikit-learn#12827)

…rmer (scikit-learn#12827)" This reverts commit ad918d4.

…ikit-learn#12827)

albertcthomas added 5 commits December 19, 2018 10:01

simplify transform for uniform output

1d99035

remove getattr from scipy.stats

10e3383

fix pep8

242f843

add details to docstrings

5c4e70e

add formula to user guide

2538c31

albertcthomas changed the title ~~[WIP] ENH simplify transform for uniform output in QuantileTransformer~~ [MRG] ENH simplify transform for uniform output in QuantileTransformer Dec 19, 2018

jnothman reviewed Dec 20, 2018

View reviewed changes

jnothman approved these changes Dec 23, 2018

View reviewed changes

peterkinalex approved these changes Dec 23, 2018

View reviewed changes

typo in doc

8adf36a

qinhanmin2014 reviewed Jan 22, 2019

View reviewed changes

no need of BOUNDS_THRESHOLD for uniform distribution

869327d

qinhanmin2014 reviewed Feb 26, 2019

View reviewed changes

sklearn/preprocessing/data.py Show resolved Hide resolved

qinhanmin2014 approved these changes Feb 26, 2019

View reviewed changes

glemaitre reviewed Feb 26, 2019

View reviewed changes

whitespace around operator

6f0b858

Co-Authored-By: albertcthomas <albertthomas88@gmail.com>

glemaitre approved these changes Feb 26, 2019

View reviewed changes

glemaitre merged commit 0f09297 into scikit-learn:master Feb 26, 2019

jrbourbeau mentioned this pull request Feb 28, 2019

Fix sklearn dev tests dask/dask-ml#474

Merged

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

ENH: simplify transform for uniform output in QuantileTransformer (sc…

ad918d4

…ikit-learn#12827)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "ENH: simplify transform for uniform output in QuantileTransfo…

98efd1d

…rmer (scikit-learn#12827)" This reverts commit ad918d4.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "ENH: simplify transform for uniform output in QuantileTransfo…

1176294

…rmer (scikit-learn#12827)" This reverts commit ad918d4.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

ENH: simplify transform for uniform output in QuantileTransformer (sc…

f4cee40

…ikit-learn#12827)

	lower_bounds_idx = (X_col == lower_bound_x)
	lower_bounds_idx = np.isclose(X_col, lower_bound_x)

Uh oh!

Conversation

albertcthomas commented Dec 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

albertcthomas commented Dec 19, 2018

Uh oh!

albertcthomas commented Dec 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertcthomas commented Dec 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertcthomas commented Feb 3, 2019

Uh oh!

albertcthomas commented Feb 25, 2019

Uh oh!

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Feb 26, 2019

Uh oh!

albertcthomas commented Feb 26, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

albertcthomas commented Feb 26, 2019

Uh oh!

glemaitre commented Feb 26, 2019 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

albertcthomas commented Dec 19, 2018 •

edited

Loading