[MRG+1] FIX: enforce consistency between dense and sparse cases in StandardScaler#11235
Conversation
|
@jnothman I would just correct the inconsistencies for the moment. It allows me to go further in the ignoring NaNs PR. |
|
|
||
|
|
||
| def _check_attributes_scalers(scaler_1, scaler_2): | ||
| assert scaler_1.mean_ == scaler_2.mean_ |
There was a problem hiding this comment.
since that this is for the case where mean_ will be None, I don't think so.
There was a problem hiding this comment.
I would rather make the check explicit in this case:
assert scaler_1.mean_ is scaler_2.mean_ is None(I did not know that such ternary identity assertions would be valid python but it seem to be the case :)
doc/whats_new/v0.20.rst
Outdated
| when returning a sparse matrix output. :issue:`11042` by :user:`Daniel | ||
| Morales <DanielMorales9>`. | ||
|
|
||
| - Fix inconsistency between sparse and dense case in |
There was a problem hiding this comment.
Is this merely an API inconsistency, or will it affect users?
There was a problem hiding this comment.
When running twice fit with sparse matrix and with_mean=False and with_std=False, it was crashing. That's why I thought it would be a bug fix.
Do you prefer to move it to API change summary (or not document it in what's new?)
There was a problem hiding this comment.
and in the dense case we change mean_ from returning an array to returning None which seems more logic and what is already happening for var_.
There was a problem hiding this comment.
I'm just trying to work out how to make clear to users how this change will affect them. I don't think your message makes that clear at all.
There was a problem hiding this comment.
Oh ok, is this formulation better
Fix inconsistencies in :class:`preprocessing.StandardScaler` with `with_mean=False`
and `with_std=False`. ``mean_`` will be set to ``None`` with both sparse and dense
inputs. ``n_samples_seen_`` will be also reported for both input types.
|
I suppose, but you also said there was a crash in the sparse case with
multiple calls...???
…On Tue, 12 Jun 2018 at 22:40, Guillaume Lemaitre ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In doc/whats_new/v0.20.rst
<#11235 (comment)>
:
> @@ -504,6 +504,10 @@ Preprocessing
when returning a sparse matrix output. :issue:`11042` by :user:`Daniel
Morales <DanielMorales9>`.
+- Fix inconsistency between sparse and dense case in
Oh ok, is this formulation better
Fix inconsistencies in :class:`preprocessing.StandardScaler` with `with_mean=False`
and `with_std=False`. ``mean_`` will be set to ``None`` with both sparse and dense
inputs. ``n_samples_seen_`` will be also reported for both input types.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11235 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67iAwl2LbLUevLLnwc3epCUIIw83ks5t77bTgaJpZM4UioOU>
.
|
|
Because ``n_samples_seen_` was not computed. |
|
So saying that you fixed partial_fit in StandardScaler would be much more
practically informative.
|
Agreed. I change the entry. |
|
Thanks :)
…On Tue, 12 Jun 2018 at 22:58, Guillaume Lemaitre ***@***.***> wrote:
So saying that you fixed partial_fit in StandardScaler would be much more
practically informative.
Agreed. I change the entry.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11235 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zvbfdZ5Upt5wZn3uQ7Ujf4Q9oxyks5t77rxgaJpZM4UioOU>
.
|
|
@ogrisel if you are around, I'd like some feedback :) |
ogrisel
left a comment
There was a problem hiding this comment.
Here are some comments on the tests. The change itself looks fine to me.
|
|
||
|
|
||
| def _check_attributes_scalers(scaler_1, scaler_2): | ||
| assert scaler_1.mean_ == scaler_2.mean_ |
There was a problem hiding this comment.
I would rather make the check explicit in this case:
assert scaler_1.mean_ is scaler_2.mean_ is None(I did not know that such ternary identity assertions would be valid python but it seem to be the case :)
| X_trans_csc = transformer_csc.fit_transform(X_csc) | ||
|
|
||
| assert_array_almost_equal(X_trans_csr.A, X_csr.A) | ||
| assert_array_almost_equal(X_trans_csc.A, X_csc.A) |
There was a problem hiding this comment.
What is this .A attributed on sparse matrices? It's not documented in the docstring. I would rather use the more explicit: X_trans_csr.toarray(), X_csr.toarray() instead.
Also, didn't we decide to favor assert_allclose instead of assert_array_almost_equal?
Also here X_dense has an integer dtype. I think We should rather make this test on np.float32 or np.float64.
If we test on integers, we should in addition check that we get the expected dtype. For standard scaling I would expect to always get floating point values on the output, even with with_mean=False, with_std=False to keep consistency. But if you disagree, I can probably be convinced otherwise.
There was a problem hiding this comment.
If we test on integers, we should in addition check that we get the expected dtype. For standard scaling I would expect to always get floating point values on the output, even with with_mean=False, with_std=False to keep consistency. But if you disagree, I can probably be convinced otherwise.
This is a good point to keep in mind. However, I would delegate this part to the issues/PRs which try to preserve the dtype: #11000
| assert_array_almost_equal(X_csc_scaled_back.toarray(), X) | ||
|
|
||
|
|
||
| def _check_attributes_scalers(scaler_1, scaler_2): |
There was a problem hiding this comment.
nitpick: This sounds French to me. _check_scalers_attributes is more natural I think.
There was a problem hiding this comment.
Also if the function is only valid for the identity case, we should probably reflect that in the function name:
_check_identity_scalers_attributes
doc/whats_new/v0.20.rst
Outdated
| Morales <DanielMorales9>`. | ||
|
|
||
| - Fix ``fit`` and ``partial_fit`` in :class:`preprocessing.StandardScaler` with | ||
| `with_mean=False` and `with_std=False` which was crashing by calling ``fit`` |
There was a problem hiding this comment.
Maybe note that this is a rare case so readers don't worry about the change
Reference Issues/PRs
closes #11234
What does this implement/fix? Explain your changes.
Ensure that the
mean_attributes andn_samples_seen_are the same in the sparse and dense cases withStandardScaler(with_mean=False, with_std=False)Any other comments?