[MRG] Add Yeo-Johnson transform to PowerTransformer by NicolasHug · Pull Request #11520 · scikit-learn/scikit-learn

NicolasHug · 2018-07-14T21:20:53Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR implements the Yeo-Johnson transform as part of the PowerTransformer class.

PowerTransformer currently only support Box-Cox which only works for positive values, Yeo-Johnson works for the whole real line.

Original paper : link.

TODO:

Any other comments?

The lambda parameter estimation is a bit tricky ~~and currently does not work.~~ (should be OK now, see below). Unlike for Box-Cox there's no scipy built-in that we can rely on. I'm having a hard time finding decent guidelines, tried to implement likelihood maximization with the brent optimizer (just like for Box-Cox) but run into overflow issues.

The transform code seems to work though:

which is a reproduction of

From Quantile regression via vector generalized additive models by Thomas W. Yee.

Code for figure (hacky):

Details

import numpy as np
from sklearn.preprocessing import PowerTransformer
import matplotlib.pyplot as plt

yj = PowerTransformer(method='yeo-johnson', standardize=False)
bc = PowerTransformer(method='box-cox', standardize=False)

X = np.arange(-4, 4, .1).reshape(-1, 1)
fig, axes = plt.subplots(ncols=2)

for lmbda in (0, .5, 1, 1.5, 2):
    X_pos = X[X > 0].reshape(-1, 1)
    bc.fit(X_pos)
    bc.lambdas_ = [lmbda]
    X_trans = bc.transform(X_pos)
    axes[0].plot(X_pos, X_trans, label=r'$\lambda = {}$'.format(lmbda))
    axes[0].set_title('Box-Cox')

    yj.fit(X)
    yj.lambdas_ = [lmbda]
    X_trans = yj.transform(X)
    axes[1].plot(X, X_trans, label=r'$\lambda = {}$'.format(lmbda))
    axes[1].set_title('Yeo-Johnson')

for ax in axes:
    ax.set(xlim=[-4, 4], ylim=[-5, 5], aspect='equal')
    ax.legend()
    ax.grid()

plt.show()

The issue was from an error in the log likelihood function

NicolasHug · 2018-07-14T23:11:06Z

Lambda param estimation should be fixed now, thanks @amueller.

Replication of this example with Yeo-Johnson instead of Box-Cox:

Need to write inverse_transform to continue

amueller · 2018-07-15T14:56:11Z

sklearn/preprocessing/data.py

+                # get rid of them to compute them.
+                _, lmbda = stats.boxcox(col[~np.isnan(col)], lmbda=None)
+                col_trans = boxcox(col, lmbda)
+            else:  # neo-johnson


Follow the white rabbit.

this took me a while.

amueller

We think it's working now, right? So we need a test for the optimization, and then documentation and adding it to an example?

amueller · 2018-07-15T14:56:42Z

sklearn/preprocessing/data.py

+        # when x >= 0
+        if lmbda < 1e-19:
+            out[pos] = np.log(x[pos] + 1)
+        else:  #lmbda != 0


space after #

amueller · 2018-07-15T14:57:36Z

sklearn/preprocessing/data.py

+            n = x.shape[0]
+
+            # Estimated mean and variance of the normal distribution
+            mu = psi.sum() / n


do we need from __future__ import division?

it's here already

thanks, was hard to see from the diff and I was lazy ;)

amueller · 2018-07-15T14:59:55Z

sklearn/preprocessing/tests/test_data.py


    assert_raise_message(ValueError, not_positive_message,
-                         power_transform, X_with_negatives)
+                         power_transform, X_with_negatives, 'box-cox')


why is this needed? The default value shouldn't change, right? Or do we want to start a cycle to change the default to yeo-johnson?

I find it clearer and explicit?

I don't know if we'll change the default but it should still be fine as PowerTransform hasn't been released yet AFAIK

good point. We should discuss before the release. I think yeo-johnson would make more sense.

The fact that Yeo-Johnson accepts negative values while Box-Cox does not makes me feel like we should use it by default. From a usability point of view, it's nicer to our users.

I have the same feeling. Plus, it is designed to be a generalization of Box-Cox, even though that's not strictly the case.

shall I change the default then?

Also added related test

ogrisel · 2018-07-15T22:30:23Z

sklearn/preprocessing/tests/test_data.py

+    pt = PowerTransformer(method=method, standardize=False)
+    pt.lambdas_ = [lmbda]
+    X_inv = pt.inverse_transform(X)
+    pt.lambdas_ = [9999]  # just to make sure


Why not:

del pt.lambdas_

Alternatively, create a new pt object from scratch to make the motivation of the test easier to read:

ground_truth_transform = PowerTransformer(method=method, standardize=False) ground_truth_transform.lambdas_ = [lmbda] X_inv = pt.inverse_transform(X) estimated_transform = PowerTransformer(method=method, standardize=False) X_inv_trans = estimated_transform.fit_transform(X_inv)

ogrisel · 2018-07-15T22:34:36Z

sklearn/preprocessing/tests/test_data.py

+    X_inv_trans = pt.fit_transform(X_inv)
+
+    assert_almost_equal(0, np.linalg.norm(X - X_inv_trans) / n_samples,
+                        decimal=2)


Please also add an assertion that checks that X_inv_trans.mean(axis=0) is close to [0.] and X_inv_trans.std(axis=0) is close to [1.].

ogrisel · 2018-07-15T22:35:10Z

sklearn/preprocessing/tests/test_data.py

+
+    rng = np.random.RandomState(0)
+    n_samples = 1000
+    X = rng.normal(size=(n_samples, 1))


to make the test more explicit you can write: X = rng.normal(loc=0., scale=1., size=(n_samples, 1))

amueller · 2018-07-15T22:37:28Z

If we want this to be the default then this is a blocker, right?

ogrisel · 2018-07-15T22:37:48Z

sklearn/preprocessing/tests/test_data.py

+    lmbda_no_nans = pt.lambdas_[0]
+
+    # concat nans at the end and check lambda stays the same
+    X = np.concatenate([X, np.full_like(X, np.nan)])


To make sure that the location of the NaNs does not impact the estimation:

from sklearn.utils import shuffle ... X = np.concatenate([X, np.full_like(X, np.nan)]) X = shuffle(X, random_state=0)

ogrisel · 2018-07-15T22:39:07Z

sklearn/preprocessing/tests/test_data.py

+
+
+@pytest.mark.parametrize("method, lmbda", [('box-cox', .5),
+                                           ('yeo-johnson', .1)])


Could you add more values for lmbda for each method? E.g.:

[ ('box-cox', .1), ('box-cox', .5), ('yeo-johnson', .1), ('yeo-johnson', .5), ('yeo-johnson', 1.), ]

ogrisel · 2018-07-15T22:49:07Z

examples/preprocessing/plot_power_transformer.py

-applied to six different probability distributions: Lognormal, Chi-squared,
-Weibull, Gaussian, Uniform, and Bimodal.
+The power transform is useful as a transformation in modeling problems where
+homoscedasticity and normality are desired. Below are examples of Box-Cox and


I don't understand what "modeling problems where homoscedasticity is desired" mean in this context: to me heteroscedasticity is a property of the noise of the output variable that is not the same for different regions of the input space a conditional model.

It does not seem trivial how power transform can improve homoscedasticity.

Actually, this statement seems to be correct:

http://article.sapub.org/10.5923.j.ajms.20180801.02.html

It might be interesting to try to come up with a good example to show this corrective effect in a (maybe synthetic) linear regression problem. However, this is probably outside of the scope of the current PR.

amueller · 2018-07-15T22:52:11Z

tagged for 0.20 and added blocker label. I don't like that we keep adding stuff but if we want to make it default we should do it now.

glemaitre

Couple of opened comments.
If I am not wrong we should have something in the common estimator_checks which force the input to be positive to work with box-cox. We probably want to change this behavior with we change the default.

glemaitre · 2018-07-16T04:39:20Z

sklearn/preprocessing/data.py

-        The power transform method. Currently, 'box-cox' (Box-Cox transform)
-        is the only option available.
+    method : str, (default='yeo-johnson')
+        The power transform method. Available methods are 'box-cox' and


We can maybe have a bullet point list for each method referring to the reference section.

glemaitre · 2018-07-16T04:41:05Z

sklearn/preprocessing/data.py

        self.lambdas_ = []
        transformed = []

+        opt_fun = {'box-cox': self._box_cox_optimize,


I would have expect func instead of fun :)

optim_function

glemaitre · 2018-07-16T04:42:02Z

sklearn/preprocessing/data.py

+        opt_fun = {'box-cox': self._box_cox_optimize,
+                   'yeo-johnson': self._yeo_johnson_optimize
+                   }[self.method]
+        trans_fun = {'box-cox': boxcox,


probably transform_function is not so long to be called

glemaitre · 2018-07-16T04:45:23Z

sklearn/preprocessing/data.py

+
+        return x_inv
+
+    def _yeo_johnson_transform(self, x, lmbda):


we cannot just define the forward transform and take 1 / _yeo_johnson_transform for the inverse?

The inverse here means f^{-1}, not 1 / f

glemaitre · 2018-07-16T04:46:18Z

sklearn/preprocessing/data.py

+            """Return the negative log likelihood of the observed data x as a
+            function of lambda."""
+            psi = self._yeo_johnson_transform(x, lmbda)
+            n = x.shape[0]


n_samples instead

glemaitre · 2018-07-16T04:46:45Z

sklearn/preprocessing/data.py

+            """Return the negative log likelihood of the observed data x as a
+            function of lambda."""
+            psi = self._yeo_johnson_transform(x, lmbda)
+            n = x.shape[0]


Uhm missing x most probably

Oh I see, can we pass x as an argument as well as in the optimize function?

yes this is a nested function so x is implicitely passed anyway

glemaitre · 2018-07-16T04:52:27Z

sklearn/preprocessing/data.py

+
+            # Estimated mean and variance of the normal distribution
+            mu = psi.sum() / n
+            sig_sq = np.power(psi - mu, 2).sum() / n


Stupid question: is sig_sq the variance? If this is the case, you might want to call it var

I was following the paper's notation. Should I use mean (or mean_) also then?

NicolasHug · 2018-07-16T14:55:17Z

I made a quick example to illustrate the use of Yeo-Johnson vs. Box-Cox + offset.

As Box-Cox only accepts positive data, one solution is to shift the data by a fixed offset value (typically min(data) + eps):

One thing we see is that the "after offset and Box-Cox" isn't as symmetric as the eo-Johnson and most importantly the values are much higher.

Is it worth adding this as an example @amueller? TBH I wouldn't be able to mathematically or intuitively explain those results.

GaelVaroquaux · 2018-07-16T14:55:59Z

+1 on comments by @glemaitre . Also, the tests are failing.

TomDLT

Thanks for the added tests.
We might want to add them as common tests at some point, but it might be for another pull-request.

TomDLT · 2018-07-17T20:20:34Z

sklearn/preprocessing/data.py

-            self._scaler = StandardScaler()
-            if force_compute_transform:
-                transformed = self._scaler.fit_transform(transformed)
+            self._scaler = StandardScaler(copy=self.copy)


actually you should be able to use copy=False here, since a copy has already been done just before.

…into yeojohnson

glemaitre

Couple of changes

glemaitre · 2018-07-17T21:55:37Z

sklearn/preprocessing/data.py

+        """Return inverse-transformed input x following Yeo-Johnson inverse
+        transform with parameter lambda.
+
+        Note


glemaitre · 2018-07-17T21:55:49Z

sklearn/preprocessing/data.py

+        """Return transformed input x following Yeo-Johnson transform with
+        parameter lambda.
+
+        Note


glemaitre · 2018-07-17T21:56:26Z

sklearn/preprocessing/data.py

        check_positive : bool
-            If True, check that all data is positive and non-zero.
+            If True, check that all data is positive and non-zero (only if
+            self.method is box-cox).


only if self.method=='box-cox'

glemaitre · 2018-07-17T22:07:21Z

I am waiting to check the example in the documentation

…into yeojohnson

NicolasHug · 2018-07-18T18:31:44Z

@glemaitre @ogrisel, I think the plot looks pretty OK now.

jnothman · 2018-07-19T20:41:31Z

I'm finding the plots in plot_map_data_to_normal relatively hard to navigate intuitively. It's not a blocker, but I think it needs to look more tabular: at the moment it takes some effort to see that each row is a different transformation; a label on the left of the row would be more helpful.

Also, having the transformations go from left to right and the datasets from top to bottom doesn't look like it would be infeasible, and would be more familiar from plot_cluster_comparison etc.

jnothman

Can I clarify why plot_all_scaling still only shows box-cox?

NicolasHug · 2018-07-19T20:56:41Z

Also, having the transformations go from left to right and the datasets from top to bottom doesn't look like it would be infeasible, and would be more familiar from plot_cluster_comparison etc.

Personally I find it easier to compare the transformations when they're stacked on each other, especially since the axes limits are uniform across the plots.

I don't have anything against having the transformation names on the left. It would also make sense to me to have one dataset per column (limiting the plot to 4 rows instead of 8), but that would make the plot wider which can be annoying on mobile.

Thanks for mentioning plot_all_scaling, I missed that one.

…into yeojohnson

NicolasHug · 2018-07-19T22:03:23Z

Looks like 14e7c32 broke plot_all_scaling on master:

Traceback (most recent call last):
  File "examples/preprocessing/plot_all_scaling.py", line 71, in <module>
    dataset = fetch_california_housing()
  File "/home/nico/dev/sklearn/sklearn/datasets/california_housing.py", line 128, in fetch_california_housing
    cal_housing = joblib.load(filepath)
  File "/home/nico/dev/sklearn/sklearn/externals/joblib/numpy_pickle.py", line 578, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/home/nico/dev/sklearn/sklearn/externals/joblib/numpy_pickle.py", line 508, in _unpickle
    obj = unpickler.load()
  File "/usr/lib64/python3.6/pickle.py", line 1050, in load
    dispatch[key[0]](self)
  File "/usr/lib64/python3.6/pickle.py", line 1338, in load_global
    klass = self.find_class(module, name)
  File "/usr/lib64/python3.6/pickle.py", line 1388, in find_class
    __import__(module, level=0)
ModuleNotFoundError: No module named 'sklearn.externals._joblib.numpy_pickle'

Should I open an issue for this? I'm not sure if this comes from my env (I created a new one from scratch, still same). Doesn't the CI check that all the examples are passing?

amueller · 2018-07-20T15:00:03Z

@NicolasHug it's now "fixed" but you need to remove your scikit_learn_data folder in your home folder.

…into yeojohnson

NicolasHug · 2018-07-20T17:27:19Z

Thanks, just updated plot_all_scaling

ogrisel · 2018-07-20T20:32:04Z

The matplotlib rendering of the 2 examples is good enough for now:

https://29575-843222-gh.circle-artifacts.com/0/doc/auto_examples/index.html#preprocessing

Merging. Thanks @NicolasHug for this nice contribution!

amueller · 2018-07-20T20:33:14Z

yay!

ogrisel · 2018-07-20T20:34:58Z

I agree with @jnothman (#11520 (comment)) that using a layout similar to the cluster comparison plot would improve the readability even further but I don't want to delay the release for this.

GaelVaroquaux · 2018-07-20T20:58:32Z

Yey!!

NicolasHug added 2 commits July 14, 2018 16:56

WIP - First draft on Yeo-Johnson transform

06891eb

Fixed lambda param optimization

a88d168

The issue was from an error in the log likelihood function

NicolasHug added 2 commits July 15, 2018 10:38

Some first tests

ee09d7f

Need to write inverse_transform to continue

Put helper method for yeo-johnson at the end

aea0842

amueller reviewed Jul 15, 2018

View reviewed changes

NicolasHug added 7 commits July 15, 2018 11:50

Added inverse transform + some tests

fba12eb

Added test for the optimization procedures

ed5a411

Created _box_cox_optimize method for better code symmetry

8bab32e

Opt for yeo-johnson not influenced by Nan

0525bab

Also added related test

Added doc

8e187c4

Better test for nan in transform()

4173df3

Updated more docs and example

61e2183

ogrisel reviewed Jul 15, 2018

View reviewed changes

updated test

b1ac8d4

ogrisel reviewed Jul 15, 2018

View reviewed changes

NicolasHug added 2 commits July 15, 2018 18:43

Modified tests according to reviews

489bc70

Changed default method from cox-box to yeo-johnson

6783e3a

ogrisel reviewed Jul 15, 2018

View reviewed changes

amueller added the Blocker label Jul 15, 2018

amueller added this to the 0.20 milestone Jul 15, 2018

glemaitre approved these changes Jul 16, 2018

View reviewed changes

NicolasHug added 2 commits July 17, 2018 15:53

Added tests for the copy parameter

1287f94

Fixed flake8 issues in example plot

0ce4b36

TomDLT approved these changes Jul 17, 2018

View reviewed changes

NicolasHug added 2 commits July 17, 2018 16:32

set copy to False for the scaler

597a85d

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

593c818

…into yeojohnson

glemaitre requested changes Jul 17, 2018

View reviewed changes

Addressed comments from glemaitre

8022cc3

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

0c120fb

…into yeojohnson

jnothman reviewed Jul 19, 2018

View reviewed changes

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

8234a3e

…into yeojohnson

NicolasHug added 2 commits July 20, 2018 13:10

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

7d529df

…into yeojohnson

Updated plot_all_scaling.py example

c0a01df

ogrisel merged commit 2d232ac into scikit-learn:master Jul 20, 2018

NicolasHug deleted the yeojohnson branch July 20, 2018 20:35

rth mentioned this pull request Aug 7, 2018

DOC Formatting in what's new #11766

Merged

This was referenced Sep 21, 2018

Request: transformation functions - Yeo-Johnson scipy/scipy#6141

Closed

ENH: Added Yeo-Johnson power transformation scipy/scipy#9305

Merged

chang mentioned this pull request Oct 7, 2018

[MRG] FIX Update power_transform docstring and add FutureWarning #12317

Merged

NicolasHug mentioned this pull request Nov 5, 2018

[MRG] Fixes to YeoJohnson transform #12522

Merged



		@pytest.mark.parametrize("method, lmbda", [('box-cox', .5),
		('yeo-johnson', .1)])

Uh oh!

Conversation

NicolasHug commented Jul 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

NicolasHug commented Jul 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Jul 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Jul 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Jul 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Jul 15, 2018

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Jul 14, 2018 •

edited

Loading

ogrisel Jul 15, 2018 •

edited

Loading

ogrisel Jul 15, 2018 •

edited

Loading

NicolasHug commented Jul 16, 2018 •

edited

Loading