[MRG+1] FIX: enforce consistency between dense and sparse cases in StandardScaler by glemaitre · Pull Request #11235 · scikit-learn/scikit-learn

glemaitre · 2018-06-11T12:45:38Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Ensure that the mean_ attributes and n_samples_seen_ are the same in the sparse and dense cases with StandardScaler(with_mean=False, with_std=False)

Any other comments?

glemaitre · 2018-06-11T12:55:06Z

@jnothman I would just correct the inconsistencies for the moment. It allows me to go further in the ignoring NaNs PR.

jnothman · 2018-06-11T23:54:28Z

sklearn/preprocessing/tests/test_data.py



+def _check_attributes_scalers(scaler_1, scaler_2):
+    assert scaler_1.mean_ == scaler_2.mean_


don't we need array equality?

since that this is for the case where mean_ will be None, I don't think so.

I would rather make the check explicit in this case:

assert scaler_1.mean_ is scaler_2.mean_ is None

(I did not know that such ternary identity assertions would be valid python but it seem to be the case :)

jnothman · 2018-06-11T23:54:35Z

doc/whats_new/v0.20.rst

  when returning a sparse matrix output. :issue:`11042` by :user:`Daniel
  Morales <DanielMorales9>`.

+- Fix inconsistency between sparse and dense case in


Is this merely an API inconsistency, or will it affect users?

When running twice fit with sparse matrix and with_mean=False and with_std=False, it was crashing. That's why I thought it would be a bug fix.

Do you prefer to move it to API change summary (or not document it in what's new?)

and in the dense case we change mean_ from returning an array to returning None which seems more logic and what is already happening for var_.

I'm just trying to work out how to make clear to users how this change will affect them. I don't think your message makes that clear at all.

Oh ok, is this formulation better

Fix inconsistencies in :class:`preprocessing.StandardScaler` with `with_mean=False` and `with_std=False`. ``mean_`` will be set to ``None`` with both sparse and dense inputs. ``n_samples_seen_`` will be also reported for both input types.

jnothman · 2018-06-12T12:42:46Z

I suppose, but you also said there was a crash in the sparse case with multiple calls...???

…

On Tue, 12 Jun 2018 at 22:40, Guillaume Lemaitre ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In doc/whats_new/v0.20.rst <#11235 (comment)> : > @@ -504,6 +504,10 @@ Preprocessing when returning a sparse matrix output. :issue:`11042` by :user:`Daniel Morales <DanielMorales9>`. +- Fix inconsistency between sparse and dense case in Oh ok, is this formulation better Fix inconsistencies in :class:`preprocessing.StandardScaler` with `with_mean=False` and `with_std=False`. ``mean_`` will be set to ``None`` with both sparse and dense inputs. ``n_samples_seen_`` will be also reported for both input types. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11235 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67iAwl2LbLUevLLnwc3epCUIIw83ks5t77bTgaJpZM4UioOU> .

glemaitre · 2018-06-12T12:48:37Z

Because ``n_samples_seen_` was not computed.

jnothman · 2018-06-12T12:51:08Z

So saying that you fixed partial_fit in StandardScaler would be much more practically informative.

glemaitre · 2018-06-12T12:58:22Z

So saying that you fixed partial_fit in StandardScaler would be much more practically informative.

Agreed. I change the entry.

jnothman · 2018-06-12T12:59:03Z

Thanks :)

…

On Tue, 12 Jun 2018 at 22:58, Guillaume Lemaitre ***@***.***> wrote: So saying that you fixed partial_fit in StandardScaler would be much more practically informative. Agreed. I change the entry. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11235 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6zvbfdZ5Upt5wZn3uQ7Ujf4Q9oxyks5t77rxgaJpZM4UioOU> .

glemaitre · 2018-06-12T13:27:52Z

@ogrisel if you are around, I'd like some feedback :)

ogrisel

Here are some comments on the tests. The change itself looks fine to me.

ogrisel · 2018-06-12T13:41:45Z

sklearn/preprocessing/tests/test_data.py



+def _check_attributes_scalers(scaler_1, scaler_2):
+    assert scaler_1.mean_ == scaler_2.mean_


I would rather make the check explicit in this case:

assert scaler_1.mean_ is scaler_2.mean_ is None

(I did not know that such ternary identity assertions would be valid python but it seem to be the case :)

ogrisel · 2018-06-12T13:51:17Z

sklearn/preprocessing/tests/test_data.py

+    X_trans_csc = transformer_csc.fit_transform(X_csc)
+
+    assert_array_almost_equal(X_trans_csr.A, X_csr.A)
+    assert_array_almost_equal(X_trans_csc.A, X_csc.A)


What is this .A attributed on sparse matrices? It's not documented in the docstring. I would rather use the more explicit: X_trans_csr.toarray(), X_csr.toarray() instead.

Also, didn't we decide to favor assert_allclose instead of assert_array_almost_equal?

Also here X_dense has an integer dtype. I think We should rather make this test on np.float32 or np.float64.

If we test on integers, we should in addition check that we get the expected dtype. For standard scaling I would expect to always get floating point values on the output, even with with_mean=False, with_std=False to keep consistency. But if you disagree, I can probably be convinced otherwise.

If we test on integers, we should in addition check that we get the expected dtype. For standard scaling I would expect to always get floating point values on the output, even with with_mean=False, with_std=False to keep consistency. But if you disagree, I can probably be convinced otherwise.

This is a good point to keep in mind. However, I would delegate this part to the issues/PRs which try to preserve the dtype: #11000

ogrisel · 2018-06-12T13:52:27Z

sklearn/preprocessing/tests/test_data.py

    assert_array_almost_equal(X_csc_scaled_back.toarray(), X)


+def _check_attributes_scalers(scaler_1, scaler_2):


nitpick: This sounds French to me. _check_scalers_attributes is more natural I think.

Also if the function is only valid for the identity case, we should probably reflect that in the function name:

_check_identity_scalers_attributes

ogrisel

LGTM.

jnothman · 2018-06-13T23:49:07Z

doc/whats_new/v0.20.rst

  Morales <DanielMorales9>`.

+- Fix ``fit`` and ``partial_fit`` in :class:`preprocessing.StandardScaler` with
+  `with_mean=False` and `with_std=False` which was crashing by calling ``fit``


Maybe note that this is a rare case so readers don't worry about the change

glemaitre · 2018-06-14T08:19:25Z

@jnothman @ogrisel Last comments addressed. Good to be merged :)

glemaitre added 3 commits June 11, 2018 14:42

FIX enforce consistency between dense and sparse cases in StandardScaler

d943e97

DOC whats new

da6cc85

TST check partial_fit and reset through fit

1c1ef03

glemaitre changed the title ~~FIX enforce consistency between dense and sparse cases in StandardScaler~~ [MRG] FIX enforce consistency between dense and sparse cases in StandardScaler Jun 11, 2018

glemaitre added 2 commits June 11, 2018 15:06

FIX do not use random for older scipy version

fc0d867

iter

46f3f49

glemaitre mentioned this pull request Jun 11, 2018

[MRG] ENH: Ignore NaNs in StandardScaler and scale #11206

Merged

9 tasks

jnothman reviewed Jun 11, 2018

View reviewed changes

glemaitre added 2 commits June 12, 2018 14:52

DOC update whats new

daf8d93

DOC update whats new

16fd126

glemaitre changed the title ~~[MRG] FIX enforce consistency between dense and sparse cases in StandardScaler~~ [MRG] FIX: count n_samples_seen in fit and partial_fit in StandardScaler Jun 12, 2018

ogrisel reviewed Jun 12, 2018

View reviewed changes

glemaitre changed the title ~~[MRG] FIX: count n_samples_seen in fit and partial_fit in StandardScaler~~ [MRG] FIX: enforce consistency between dense and sparse cases in StandardScaler Jun 12, 2018

address ogrisel comments

fe0b08d

ogrisel approved these changes Jun 13, 2018

View reviewed changes

ogrisel changed the title ~~[MRG] FIX: enforce consistency between dense and sparse cases in StandardScaler~~ [MRG+1] FIX: enforce consistency between dense and sparse cases in StandardScaler Jun 13, 2018

ogrisel added this to the 0.20 milestone Jun 13, 2018

jnothman approved these changes Jun 13, 2018

View reviewed changes

DOC do not worry to much

023492e

jnothman merged commit a4f8e3d into scikit-learn:master Jun 14, 2018



		def _check_attributes_scalers(scaler_1, scaler_2):
		assert scaler_1.mean_ == scaler_2.mean_

		assert_array_almost_equal(X_csc_scaled_back.toarray(), X)


		def _check_attributes_scalers(scaler_1, scaler_2):

Uh oh!

Conversation

glemaitre commented Jun 11, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

glemaitre commented Jun 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 12, 2018 via email

Uh oh!

glemaitre commented Jun 12, 2018

Uh oh!

jnothman commented Jun 12, 2018 via email

Uh oh!

glemaitre commented Jun 12, 2018

Uh oh!

jnothman commented Jun 12, 2018 via email

Uh oh!

glemaitre commented Jun 12, 2018

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jun 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants