[MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale#11308
[MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale#11308ogrisel merged 16 commits intoscikit-learn:masterfrom
Conversation
37f7cb5 to
b1c5a21
Compare
|
@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014 |
jnothman
left a comment
There was a problem hiding this comment.
Your sparse adaptation doesn't work. The qth quantile of non-zero values is not the qth quantile of values including (majority) zeros.
Oh so I should materialize the zero. |
|
I think just leave it as not handling the sparse case. The IQR is only
likely to be > 0 once density > 20%. So for many sparse matrices, quantiles
are not relevant.
|
In this case, it should be mentioned that |
Actually I have a second thought. More specifically, we should deprecate |
|
|
||
| def test_robust_scaler_equivalence_dense_sparse(): | ||
| # Check the equivalence of the fitting with dense and sparse matrices | ||
| X_sparse = sparse.rand(1000, 5, density=0.5).tocsc() |
There was a problem hiding this comment.
To avoid regressions, please make sure this works with an all-positive, all-negative, all-zero, positive, negative and zero, and all-nonzero columns. Check with a couple of densities and also integer dtypes.
There was a problem hiding this comment.
NaNs with integer dtype should not work.
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Reminder: update transform docstring
|
This pull request introduces 1 alert when merging de147a4 into e555893 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
|
@jnothman I added a test for different cases of density and signs of the matrix. |
|
This pull request introduces 1 alert when merging f884532 into e555893 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
| for feature_idx in range(X.shape[1]): | ||
| if sparse.issparse(X): | ||
| column_nnz_data = X.data[X.indptr[feature_idx]: | ||
| X.indptr[feature_idx + 1]] |
There was a problem hiding this comment.
I assume you realise that there should be a more asymptotically efficient way to handle the sparse case, as it should be easy to work out whether a percentile is zero, positive or negative, then adjust the quantile parameter...
But this is fine in the first instance.
There was a problem hiding this comment.
To be honest, I don't know about it.
| assert_array_almost_equal(X_scaled.std(axis=0)[0], 0) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("density", [0, 0.01, 0.05, 0.1, 0.2, 0.5, 1]) |
| # Check the equivalence of the fitting with dense and sparse matrices | ||
| X_sparse = sparse.rand(1000, 5, density=density).tocsc() | ||
| if strictly_signed == 'positif': | ||
| X_sparse.data += X_sparse.min() |
There was a problem hiding this comment.
This won't make it positive. Subtracting the min will. So will abs
|
|
||
| @pytest.mark.parametrize("density", [0, 0.01, 0.05, 0.1, 0.2, 0.5, 1]) | ||
| @pytest.mark.parametrize("strictly_signed", | ||
| ['positif', 'negatif', 'zeros', None]) |
|
This pull request introduces 1 alert when merging 7f22b3f into 93382cc - view on LGTM.com new alerts:
Comment posted by LGTM.com |
|
This pull request introduces 1 alert when merging 05a23f6 into eec7649 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
doc/whats_new/v0.20.rst
Outdated
| :func:`preprocessing.power_transform` ignore and pass-through NaN values. | ||
| :issue:`11306` by :user:`Guillaume Lemaitre <glemaitre>`. | ||
| - :class:`preprocessing.RobustScaler` and :func:`preprocessing.robust_scale` | ||
| ignore and pass-through NaN values. In addition, the scaler can now be fitted |
There was a problem hiding this comment.
I think the second comment is a separate enhancement and should be reported separately.
I would like to see the "ignore and pass through NaN values" what's news merged into one (not necessarily in this PR).
sklearn/preprocessing/data.py
Outdated
| self.scale_ = (q[1] - q[0]) | ||
| quantiles.append( | ||
| nanpercentile(column_data, self.quantile_range) | ||
| if column_data.size else [0, 0]) |
There was a problem hiding this comment.
I think you no longer need this if column_data.size check?
There was a problem hiding this comment.
nop because it returns nan otherwise (which is consistent with the numpy function).
In [1]: import numpy as np
In [2]: from sklearn.utils.fixes import nanpercentile
In [3]: nanpercentile([], [25, 50, 75])
/home/lemaitre/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/lib/nanfunctions.py:1148: RuntimeWarning: Mean of empty slice
return np.nanmean(a, axis, out=out, keepdims=keepdims)
Out[3]: nan
In [4]: np.nanpercentile([], [25, 50, 75])
/home/lemaitre/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/lib/nanfunctions.py:1148: RuntimeWarning: Mean of empty slice
return np.nanmean(a, axis, out=out, keepdims=keepdims)
Out[4]: nanThere was a problem hiding this comment.
But when would you encounter 0 samples after check_array?
There was a problem hiding this comment.
With sparse matrices with all zero column:
In [1]: from scipy import sparse
In [3]: from sklearn.utils import check_array
In [4]: X = sparse.rand(10, 2, density=0).tocsr()
In [5]: check_array(X, accept_sparse=True)
Out[5]:
<10x2 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>
In [6]: X.data
Out[6]: array([], dtype=float64)There was a problem hiding this comment.
Yes, but you're now only ever passing in something with shape X.shape[0]
There was a problem hiding this comment.
Oh sorry, I see now
column_data = np.zeros(shape=X.shape[0], dtype=X.dtype)
I missed this and I was thinking that we were using the data.nnz which can be an empty array.
Making the changes then.
sklearn/preprocessing/data.py
Outdated
|
|
||
| quantiles = np.transpose(quantiles) | ||
|
|
||
| self.scale_ = (quantiles[1] - quantiles[0]) |
05a23f6 to
eecb39a
Compare
|
This pull request introduces 1 alert when merging eecb39a into 3b5abf7 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
|
This pull request introduces 1 alert when merging 86cf707 into 3b5abf7 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
|
Have you cheers that unused import? |
|
I don't agree with LGTM. This is an import in fixes. Sent from my phone - sorry to be brief and potential misspell.
|
|
@ogrisel, good to merge? |
|
Merged! Thanks, @glemaitre! |
| check_is_fitted(self, 'center_', 'scale_') | ||
| X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy, | ||
| estimator=self, dtype=FLOAT_DTYPES, | ||
| force_all_finite='allow-nan') |
There was a problem hiding this comment.
FYI this affected dask-ml. Previously the logic in this transformer was equally applicable to numpy and dask arrays. Now it auto-converts dask arrays to numpy arrays.
Reference Issues/PRs
Toward #10404
Merge #11011 before due to better test in the common tests.
What does this implement/fix? Explain your changes.
TODO:
Any other comments?