[MRG] ignore NaNs in PowerTransformer#11306
Conversation
|
The diff will be better once #11206 will be merged. |
|
My concern with this PR is this line: It seems that the computation of the lambdas is not robust to nan values. Therefore, we need to slice the column to find the lambdas and then compute the using boxcox (which ignore nan) |
Why is that concerning? It's got roughly the same complexity as |
|
@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014 |
ogrisel
left a comment
There was a problem hiding this comment.
Besides the following small comment, the PowerTransformer specific part of this PR LGTM.
sklearn/preprocessing/data.py
Outdated
| # get rid of them to compute them. | ||
| _, lmbda = stats.boxcox(col[~np.isnan(col)], lmbda=None) | ||
| # FIXME: stats.boxcox should be changed by special.boxcox which | ||
| # handles NaN and does not raise warnings. Check SciPy 0.14.X |
There was a problem hiding this comment.
Why not use special.boxcox if the installed scipy is recent enough? This will make it easier to drop the backward compat code once we do not support scipy < 0.14.
Ideally, the boxcox backward compat function that ignores the nan warnings should be extracted in sklearn.utils.fixes to make it easy to clean up the backward compat code in the future.
There was a problem hiding this comment.
It may not be so easy to backport special.boxcox... It's something like:
from libc.math cimport log, log1p, expm1, exp, fabs
cimport numpy as np
cdef inline double _func_boxcox(double x, double lmbda) nogil:
# if lmbda << 1 and log(x) < 1.0, the lmbda*log(x) product can lose
# precision, furthermore, expm1(x) == x for x < eps.
# For doubles, the range of log is -744.44 to +709.78, with eps being
# the smallest value produced. This range means that we will have
# abs(lmbda)*log(x) < eps whenever abs(lmbda) <= eps/-log(min double)
# which is ~2.98e-19.
if fabs(lmbda) < 1e-19:
return log(x)
else:
return expm1(lmbda * log(x)) / lmbda
cdef void loop_d_dd__As_dd_d(char **args, np.npy_intp *dims, np.npy_intp *steps, void *data) nogil:
cdef np.npy_intp i, n = dims[0]
cdef void *func = (<void**>data)[0]
cdef char *func_name = <char*>(<void**>data)[1]
cdef char *ip0 = args[0]
cdef char *ip1 = args[1]
cdef char *op0 = args[2]
cdef double ov0
for i in range(n):
ov0 = (<double(*)(double, double) nogil>func)(<double>(<double*>ip0)[0], <double>(<double*>ip1)[0])
(<double *>op0)[0] = <double>ov0
ip0 += steps[0]
ip1 += steps[1]
op0 += steps[2]
sf_error.check_fpe(func_name)
cdef void loop_d_dd__As_ff_f(char **args, np.npy_intp *dims, np.npy_intp *steps, void *data) nogil:
cdef np.npy_intp i, n = dims[0]
cdef void *func = (<void**>data)[0]
cdef char *func_name = <char*>(<void**>data)[1]
cdef char *ip0 = args[0]
cdef char *ip1 = args[1]
cdef char *op0 = args[2]
cdef double ov0
for i in range(n):
ov0 = (<double(*)(double, double) nogil>func)(<double>(<float*>ip0)[0], <double>(<float*>ip1)[0])
(<float *>op0)[0] = <float>ov0
ip0 += steps[0]
ip1 += steps[1]
op0 += steps[2]
sf_error.check_fpe(func_name)
cdef np.PyUFuncGenericFunction ufunc_boxcox_loops[2]
cdef void *ufunc_boxcox_ptr[4]
cdef void *ufunc_boxcox_data[2]
cdef char ufunc_boxcox_types[6]
cdef char *ufunc_boxcox_doc = (
"boxcox(x, lmbda)\n"
"\n"
"Compute the Box-Cox transformation.\n"
"\n"
"The Box-Cox transformation is::\n"
"\n"
" y = (x**lmbda - 1) / lmbda if lmbda != 0\n"
" log(x) if lmbda == 0\n"
"\n"
"Returns `nan` if ``x < 0``.\n"
"Returns `-inf` if ``x == 0`` and ``lmbda < 0``.\n"
"\n"
"Parameters\n"
"----------\n"
"x : array_like\n"
" Data to be transformed.\n"
"lmbda : array_like\n"
" Power parameter of the Box-Cox transform.\n"
"\n"
"Returns\n"
"-------\n"
"y : array\n"
" Transformed data.\n"
"\n"
"Notes\n"
"-----\n"
"\n"
".. versionadded:: 0.14.0\n"
"\n"
"Examples\n"
"--------\n"
">>> from scipy.special import boxcox\n"
">>> boxcox([1, 4, 10], 2.5)\n"
"array([ 0. , 12.4 , 126.09110641])\n"
">>> boxcox(2, [0, 1, 2])\n"
"array([ 0.69314718, 1. , 1.5 ])")
ufunc_boxcox_loops[0] = <np.PyUFuncGenericFunction>loop_d_dd__As_ff_f
ufunc_boxcox_loops[1] = <np.PyUFuncGenericFunction>loop_d_dd__As_dd_d
ufunc_boxcox_types[0] = <char>NPY_FLOAT
ufunc_boxcox_types[1] = <char>NPY_FLOAT
ufunc_boxcox_types[2] = <char>NPY_FLOAT
ufunc_boxcox_types[3] = <char>NPY_DOUBLE
ufunc_boxcox_types[4] = <char>NPY_DOUBLE
ufunc_boxcox_types[5] = <char>NPY_DOUBLE
ufunc_boxcox_ptr[2*0] = <void*>_func_boxcox
ufunc_boxcox_ptr[2*0+1] = <void*>(<char*>"boxcox")
ufunc_boxcox_ptr[2*1] = <void*>_func_boxcox
ufunc_boxcox_ptr[2*1+1] = <void*>(<char*>"boxcox")
ufunc_boxcox_data[0] = &ufunc_boxcox_ptr[2*0]
ufunc_boxcox_data[1] = &ufunc_boxcox_ptr[2*1]
boxcox = np.PyUFunc_FromFuncAndData(ufunc_boxcox_loops, ufunc_boxcox_data, ufunc_boxcox_types, 2, 2, 1, 0, "boxcox", ufunc_boxcox_doc, 0)There was a problem hiding this comment.
But I've now realised you probably aren't suggesting to backport special.boxcox, merely to move these few lines to fixes. Yes.
|
This pull request introduces 2 alerts when merging 08960ec into e555893 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
|
This pull request introduces 2 alerts when merging e626a39 into e555893 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Reference Issues/PRs
Towards #10404.
Should be merged after:
What does this implement/fix? Explain your changes.
Any other comments?