[MRG] Fixes to YeoJohnson transform#12522
Conversation
| assert_almost_equal(1, X_inv_trans.std(), decimal=1) | ||
|
|
||
|
|
||
| def test_yeo_johnson_darwin_example(): |
There was a problem hiding this comment.
Yes, we should try to insist on such tests where possible, but we often forget
doc/whats_new/v0.20.rst
Outdated
| in version 0.23. A FutureWarning is raised when the default value is used. | ||
| :issue:`12317` by :user:`Eric Chang <chang>`. | ||
|
|
||
| - |Fix| Fixed bug in :class:`preprocessing.OrdinalEncoder` when passing |
|
|
||
| # when x >= 0 | ||
| if lmbda < 1e-19: | ||
| if abs(lmbda) < np.spacing(1.): |
There was a problem hiding this comment.
ok so the abs here is the bugfix, right, and the rest is minor stability changes?
|
Thanks @NicolasHug |
…ybutton * upstream/master: FIX YeoJohnson transform lambda bounds (scikit-learn#12522) [MRG] Additional Warnings in case OpenML auto-detected a problem with dataset (scikit-learn#12541) ENH Prefer threads for IsolationForest (scikit-learn#12543) joblib 0.13.0 (scikit-learn#12531) DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537) DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533) ALL Add HashingVectorizer to __all__ (scikit-learn#12534) BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350) fix typo in whatsnew Fix dead link to numpydoc (scikit-learn#12532) [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485) MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529) FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462) MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
|
So, the yeo-johnson transformation parameter MUST be between 0 and 2 to ensure that the domain of the inverse transform is on the real line. What was the reasoning for removing the constraints during the optimization for lambda? |
|
What do you mean "on the real line"? How could the transformation give complex output if lambda is not in [0, 2]? |
|
Also note that there never was any constraint on lambda to be in [0, 2]. This PR just fixes the transformations, not the optimization procedure for lambda. The scipy brent optimizer accepts bracket (here -2, 2) but as you can read on the docs it doesn't mean the value will be in the interval. |
|
Ok got it, x_trans = 2 If x_trans = 2 and l = -0.9900990099009901, then performing the inverse transform you would be raising a negative number to a negative power which would be complex. |
|
Also, you could use fminbound to perform the optimization and respect the constraints of [0, 2] |
|
Would you have a reproducing example that would produce such xtrans and lamda? |
|
Logit Transforming this data, then fitting the yeo-johnson transform produces a lambda value of -1.42272153. Trying to inverse_transform anything greater than -1 / -1.422272153 = 0.7028782364740063 will produce a nan since the inverse operation will attempt to take the root of a negative value. |
|
Do you have an actual code example? I cannot reprocude your results. import pandas as pd
import numpy as np
from sklearn.preprocessing import PowerTransformer
df = pd.read_csv('success_rate.txt')
X = np.array(df.success_rate).reshape(-1, 1)
t = PowerTransformer(method='yeo-johnson')
t.fit(X)
print(t.lambdas_)
X_trans = t.transform(X)
print((X_trans > (-1 / t.lambdas_)).sum())
print(np.iscomplex(t.inverse_transform(X_trans)).sum())Also logit-transforming does not give you the expected lambda |
|
The reason I'm asking for actual code is that you might just be trying to inverse-transform values that aren't in the range of the transformation to begin with. In which case it's just normal that you end up with complex / bad values. |
|
I can send code, but you are right that with fitted lambda I am taking the
inverse of values outside the range. However, with the above constraints on
lambda the range would be the real line and I wouldn't run into that issue.
Wouldn't it make sense to include an option to restrict lambda to be
between 0 and 2?
…On Fri, Feb 1, 2019, 9:03 AM Nicolas Hug ***@***.*** wrote:
The reason I'm asking for actual code is that you might just be trying to
inverse-transform values that aren't in the range of the transformation to
begin with. In which case it's just normal that you end up with complex /
bad values.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#12522 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADOm8KKotVMuZnjQuEBKVtUNfa4rC886ks5vJElMgaJpZM4YO57t>
.
|
|
If you only get complex values when you try to inverse transform values that shouldn't be inverse transformed (that is, breaking the math, basically), then I don't think we should be handling this use-case. However if it's possible to get complex values by inverse-transforming values that are in the range of possible values of That being said a simple workaround on your side would be e.g. t.lambdas_ = np.clip(t.lambdas_, 0, 2) |
|
I don't think that workaround would work with scaling so simply since the scaler would have to be re-fit iirc. However, what I'm getting at is that if I want the range of my transformation to be the real line, then I would need 0 <= lambda <= 2 and I would want the range to be the real line so that I could take the inverse transformation of any number on the real line. If this is something that you don't want to be supported, then I guess that's fine, I was just wondering the rationale. |
This reverts commit 2585410.
This reverts commit 2585410.
Reference Issues/PRs
What does this implement/fix? Explain your changes.
While implementing the Yeo-Johnson transformation in scipy scipy/scipy#9305, I noticed a few mistakes I made in my first implementation here (#11520).
This PR:
lambdaand 0/2, I was mistaken in thinking that lambda was necessarily in [0, 2] which isn't the case)log1p,spacing(1.)as advised during the scipy PR reviewsAny other comments?
There's a numerical instability issue in scipy's box-cox (scipy/scipy#6873) that is reproducible for yeo-johnson as well. I'm keeping an eye on this and will port the fix once it's merged on scipy.