[MRG] Fixes to YeoJohnson transform by NicolasHug · Pull Request #12522 · scikit-learn/scikit-learn

NicolasHug · 2018-11-05T17:26:25Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

While implementing the Yeo-Johnson transformation in scipy scipy/scipy#9305, I noticed a few mistakes I made in my first implementation here (#11520).

This PR:

fixes the comparison between lambda and 0/2, I was mistaken in thinking that lambda was necessarily in [0, 2] which isn't the case)
uses more appropriate numpy builtins log1p, spacing(1.) as advised during the scipy PR reviews
adds a sanity check test from the original Yeo-Johnson paper

Any other comments?

There's a numerical instability issue in scipy's box-cox (scipy/scipy#6873) that is reproducible for yeo-johnson as well. I'm keeping an eye on this and will port the fix once it's merged on scipy.

jnothman

Needs a what's new

sklearn/preprocessing/data.py

jnothman · 2018-11-05T20:40:17Z

sklearn/preprocessing/tests/test_data.py

    assert_almost_equal(1, X_inv_trans.std(), decimal=1)


+def test_yeo_johnson_darwin_example():


Yes, we should try to insist on such tests where possible, but we often forget

doc/whats_new/v0.21.rst

sklearn/preprocessing/data.py

amueller · 2018-11-07T18:19:08Z

doc/whats_new/v0.20.rst

  in version 0.23. A FutureWarning is raised when the default value is used.
  :issue:`12317` by :user:`Eric Chang <chang>`.

+- |Fix| Fixed bug in :class:`preprocessing.OrdinalEncoder` when passing


Merge weirdness?

…_fix

amueller · 2018-11-07T22:59:39Z

sklearn/preprocessing/data.py


        # when x >= 0
-        if lmbda < 1e-19:
+        if abs(lmbda) < np.spacing(1.):


ok so the abs here is the bugfix, right, and the rest is minor stability changes?

amueller

lgtm

jnothman · 2018-11-08T07:57:50Z

Thanks @NicolasHug

…ybutton * upstream/master: FIX YeoJohnson transform lambda bounds (scikit-learn#12522) [MRG] Additional Warnings in case OpenML auto-detected a problem with dataset (scikit-learn#12541) ENH Prefer threads for IsolationForest (scikit-learn#12543) joblib 0.13.0 (scikit-learn#12531) DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537) DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533) ALL Add HashingVectorizer to __all__ (scikit-learn#12534) BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350) fix typo in whatsnew Fix dead link to numpydoc (scikit-learn#12532) [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485) MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529) FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462) MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)

gammadistribution · 2019-01-31T21:03:21Z

So, the yeo-johnson transformation parameter MUST be between 0 and 2 to ensure that the domain of the inverse transform is on the real line. What was the reasoning for removing the constraints during the optimization for lambda?

NicolasHug · 2019-01-31T21:17:34Z

What do you mean "on the real line"? How could the transformation give complex output if lambda is not in [0, 2]?

NicolasHug · 2019-01-31T21:21:23Z

Also note that there never was any constraint on lambda to be in [0, 2]. This PR just fixes the transformations, not the optimization procedure for lambda.

The scipy brent optimizer accepts bracket (here -2, 2) but as you can read on the docs it doesn't mean the value will be in the interval.

gammadistribution · 2019-01-31T21:22:56Z

Ok got it,

x_trans = 2
lmbda = -0.9900990099009901

If x_trans = 2 and l = -0.9900990099009901, then performing the inverse transform you would be raising a negative number to a negative power which would be complex.

gammadistribution · 2019-02-01T02:26:38Z

Also, you could use fminbound to perform the optimization and respect the constraints of [0, 2]

NicolasHug · 2019-02-01T03:46:57Z

Would you have a reproducing example that would produce such xtrans and lamda?

gammadistribution · 2019-02-01T05:23:14Z

Logit Transforming this data, then fitting the yeo-johnson transform produces a lambda value of -1.42272153. Trying to inverse_transform anything greater than -1 / -1.422272153 = 0.7028782364740063 will produce a nan since the inverse operation will attempt to take the root of a negative value.

success_rate.txt

NicolasHug · 2019-02-01T13:45:41Z

Do you have an actual code example? I cannot reprocude your results.

import pandas as pd
import numpy as np
from sklearn.preprocessing import PowerTransformer

df = pd.read_csv('success_rate.txt')

X = np.array(df.success_rate).reshape(-1, 1)

t = PowerTransformer(method='yeo-johnson')
t.fit(X)
print(t.lambdas_)

X_trans = t.transform(X)
print((X_trans > (-1 / t.lambdas_)).sum())

print(np.iscomplex(t.inverse_transform(X_trans)).sum())

[-1.42272153]
1327
0

Also logit-transforming does not give you the expected lambda

NicolasHug · 2019-02-01T14:02:55Z

The reason I'm asking for actual code is that you might just be trying to inverse-transform values that aren't in the range of the transformation to begin with. In which case it's just normal that you end up with complex / bad values.

gammadistribution · 2019-02-01T14:09:51Z

I can send code, but you are right that with fitted lambda I am taking the inverse of values outside the range. However, with the above constraints on lambda the range would be the real line and I wouldn't run into that issue. Wouldn't it make sense to include an option to restrict lambda to be between 0 and 2?

…

On Fri, Feb 1, 2019, 9:03 AM Nicolas Hug ***@***.*** wrote: The reason I'm asking for actual code is that you might just be trying to inverse-transform values that aren't in the range of the transformation to begin with. In which case it's just normal that you end up with complex / bad values. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#12522 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADOm8KKotVMuZnjQuEBKVtUNfa4rC886ks5vJElMgaJpZM4YO57t> .

NicolasHug · 2019-02-01T14:16:03Z

If you only get complex values when you try to inverse transform values that shouldn't be inverse transformed (that is, breaking the math, basically), then I don't think we should be handling this use-case.

However if it's possible to get complex values by inverse-transforming values that are in the range of possible values of transform the we should take a closer look at this problem.

That being said a simple workaround on your side would be e.g.

t.lambdas_ = np.clip(t.lambdas_, 0, 2)

gammadistribution · 2019-02-01T14:23:10Z

I don't think that workaround would work with scaling so simply since the scaler would have to be re-fit iirc. However, what I'm getting at is that if I want the range of my transformation to be the real line, then I would need 0 <= lambda <= 2 and I would want the range to be the real line so that I could take the inverse transformation of any number on the real line.

If this is something that you don't want to be supported, then I guess that's fine, I was just wondering the rationale.

This reverts commit 2585410.

NicolasHug added 2 commits November 5, 2018 12:16

YeoJohnson transform fixes

853d2b3

Fixed doctest

a39ccc9

jnothman reviewed Nov 5, 2018

View reviewed changes

Added whatnew entry

1c63516

jnothman approved these changes Nov 6, 2018

View reviewed changes

doc/whats_new/v0.21.rst Outdated Show resolved Hide resolved

sklearn/preprocessing/data.py Outdated Show resolved Hide resolved

NicolasHug added 2 commits November 5, 2018 21:12

addressed comments

030c6d1

Merge branch 'master' into yeo_fix

5b1d2e6

amueller reviewed Nov 7, 2018

View reviewed changes

amueller added this to the 0.20.1 milestone Nov 7, 2018

NicolasHug added 2 commits November 7, 2018 14:44

fixed whatnew ?

fad83fe

Merge branch 'yeo_fix' of github.com:NicolasHug/scikit-learn into yeo…

54f77f7

…_fix

amueller reviewed Nov 7, 2018

View reviewed changes

amueller approved these changes Nov 7, 2018

View reviewed changes

jnothman merged commit 042843a into scikit-learn:master Nov 8, 2018

thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018

FIX YeoJohnson transform lambda bounds (scikit-learn#12522)

983b836

thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018

FIX YeoJohnson transform lambda bounds (scikit-learn#12522)

4d15d8c

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018

FIX YeoJohnson transform lambda bounds (scikit-learn#12522)

b71d7c8

jnothman mentioned this pull request Nov 14, 2018

Doctest failure for PowerTransformer on mac? #12584

Closed

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018

FIX YeoJohnson transform lambda bounds (scikit-learn#12522)

2ccc921

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FIX YeoJohnson transform lambda bounds (scikit-learn#12522)

2585410

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX YeoJohnson transform lambda bounds (scikit-learn#12522)"

3695472

This reverts commit 2585410.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX YeoJohnson transform lambda bounds (scikit-learn#12522)"

efcc44e

This reverts commit 2585410.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX YeoJohnson transform lambda bounds (scikit-learn#12522)

8af1b2f

bcbrock mentioned this pull request Sep 13, 2019

Errors with the Yeo-Johnson Transform that also Appear in Scikit-Learn scipy/scipy#10821

Closed

mathause mentioned this pull request May 6, 2024

align equality check in yeo johnson transform MESMER-group/mesmer#436

Merged

		assert_almost_equal(1, X_inv_trans.std(), decimal=1)


		def test_yeo_johnson_darwin_example():

Uh oh!

Conversation

NicolasHug commented Nov 5, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman Nov 5, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amueller Nov 7, 2018

Choose a reason for hiding this comment

Uh oh!

amueller Nov 7, 2018

Choose a reason for hiding this comment

Uh oh!

NicolasHug Nov 7, 2018

Choose a reason for hiding this comment

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Nov 8, 2018

Uh oh!

gammadistribution commented Jan 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Jan 31, 2019

Uh oh!

NicolasHug commented Jan 31, 2019

Uh oh!

gammadistribution commented Jan 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gammadistribution commented Feb 1, 2019

Uh oh!

NicolasHug commented Feb 1, 2019

Uh oh!

gammadistribution commented Feb 1, 2019

Uh oh!

NicolasHug commented Feb 1, 2019

Uh oh!

NicolasHug commented Feb 1, 2019

Uh oh!

gammadistribution commented Feb 1, 2019 via email

Uh oh!

NicolasHug commented Feb 1, 2019

Uh oh!

gammadistribution commented Feb 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gammadistribution commented Jan 31, 2019 •

edited

Loading

gammadistribution commented Jan 31, 2019 •

edited

Loading