[MRG+1] SPCA centering and fixing scaling issue by summer-bebop · Pull Request #11585 · scikit-learn/scikit-learn

summer-bebop · 2018-07-17T09:30:51Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

@Andrewww reported a scaling issue with SPCA : the transform scaling was depending on the number of samples. Investigating this issue we found several things that were disturbing :

In the transform of SPCA the last lines were consisting in a peculiar normalization sequence. These lines were the cause of the issue.
The SPCA with alpha=0 and alpha_ridge=0 and the standard PCA were giving inconsistent results.
The data were not centred during the fit.
The data were not centred either during the transform.
The components were not normalized. I'm still not sure that they should be but from what I saw in this paper I'm inclined to say yes but I did not read it thoroughly yet and I did not cross sources.

This PR aims to fix all that. With all the fixes implemented, the scaling issue disappear and the transform from PCA and SPCA(alpha=0, ridge_alpha=0) gives exactly the same results.

Any other comments?

TODO :

Decide on the normalization strategy (read more thoroughly)
Implement the fixes inside a deprecation path
Implement the fixes for MiniBatchSparsePCA
Implement scaling test (from the bug report)
Implement PCA vs SPCA(0, 0) test (to check the new behaviour is correct)

summer-bebop · 2018-07-17T16:12:25Z

@agramfort Rdy for first review (if tests pass)

massich · 2018-07-17T17:32:26Z

sklearn/decomposition/sparse_pca.py

        If None, the random number generator is the RandomState instance used
        by `np.random`.

+    normalize_components : boolean


I would show in the docstring that normalize_components is optional and which is its default.
like here:

scikit-learn/sklearn/ensemble/bagging.py

Line 477 in bcd6ff3

bootstrap_features : boolean, optional (default=False)

massich · 2018-07-17T17:33:50Z

sklearn/decomposition/sparse_pca.py

+    normalize_components : boolean
+        False : Use a version of Sparse PCA without components normalization
+        and without data centring. This is likely a bug and eventhough it's
+        the default for backward compatibility, this should not be used


if normalize_components=False should not be used it but its the default for back compatibility, the usual deprecation cycle should be added

my fault. You already have the deprecation warning. You should add when wash this added and when would it be removed, in the docstring

class SparsePCA(BaseEstimator, TransformerMixin): """Sparse Principal Components Analysis (SparsePCA) ... Parameters ---------- ... normalize_components : boolean, optional (default=False) False : Use a version of Sparse PCA without ... .. versionadded:: 0.20 .. deprecated:: 0.22 ``normalize_components`` was added and set to ``False`` for bakward compatibility. It would be set to ``True`` from 0.22 onwards.

Now I'm not sure if we do this, so feel free to contradict me.

Yes, technically we add the parameter, then change its value, then remove it. It's long!

If it is a bug fix we should just change the behavior, isn't it?

massich

would it normalize_components be removed in version 0.22 and kept to true?

Or would that be a second deprecation cycle, and removed in 0.24?

if removed in 0.22, it should be stated in the deprecation messages.

massich · 2018-07-17T19:01:09Z

sklearn/decomposition/sparse_pca.py

+    normalize_components : boolean
+        False : Use a version of Sparse PCA without components normalization
+        and without data centring. This is likely a bug and eventhough it's
+        the default for backward compatibility, this should not be used


Now I'm not sure if we do this, so feel free to contradict me.

GaelVaroquaux · 2018-07-17T20:10:25Z

sklearn/decomposition/tests/test_sparse_pca.py

+                          random_state=rng, normalize_components=True)
+    results_train = spca_lars.fit_transform(Y)
+    results_test = spca_lars.transform(Y[:10])
+    assert_array_equal(results_train[0], results_test[0])


This is probably too stringent (and it's failing on travis). You probably want to use assert_array_almost_equal.

assert_allclose in fact

GaelVaroquaux · 2018-07-17T20:13:18Z

sklearn/decomposition/sparse_pca.py

+            self.mean_ = np.mean(X, axis=0)
+            X = X - self.mean_
+        else:
+            warnings.warn("normalize_components should be set to True. "


I would be more explicit: "normalize_components=False is a backward-compatible setting that implements a non-standard definition of sparse PCA. This compatibility mode will be removed in 0.22."

GaelVaroquaux

Aside from the small comments I made (including fixing travis), there needs to be an entry added to whats_new. This is an important change.

glemaitre · 2018-07-17T21:34:01Z

I am confused here. Do we actually want to keep the buggy behaviour?

GaelVaroquaux · 2018-07-17T21:36:49Z

I am confused here. Do we actually want to keep the buggy behaviour?

Just in case people were relying on it. I can be talked out of this opinion.

glemaitre · 2018-07-17T21:34:48Z

sklearn/decomposition/sparse_pca.py

        by `np.random`.

+    normalize_components : boolean
+        False : Use a version of Sparse PCA without components normalization


Use indent

* If False, ... * If True, ...

glemaitre · 2018-07-17T21:37:16Z

sklearn/decomposition/sparse_pca.py


+    normalize_components : boolean
+        False : Use a version of Sparse PCA without components normalization
+        and without data centring. This is likely a bug and eventhough it's


centring -> centering

glemaitre · 2018-07-17T21:38:12Z

sklearn/decomposition/sparse_pca.py

+    normalize_components : boolean
+        False : Use a version of Sparse PCA without components normalization
+        and without data centring. This is likely a bug and eventhough it's
+        the default for backward compatibility, this should not be used


If it is a bug fix we should just change the behavior, isn't it?

glemaitre · 2018-07-17T21:38:31Z

sklearn/decomposition/sparse_pca.py

+        and without data centring. This is likely a bug and eventhough it's
+        the default for backward compatibility, this should not be used
+        True : Use a version of Sparse PCA with components normalization
+        and data centring


add a full stop.

glemaitre · 2018-07-17T21:39:07Z

sklearn/decomposition/sparse_pca.py


+    mean_ : array, shape (n_features,)
+        Per-feature empirical mean, estimated from the training set.
+


remove this blank line

glemaitre · 2018-07-17T21:39:21Z

sklearn/decomposition/sparse_pca.py

+    mean_ : array, shape (n_features,)
+        Per-feature empirical mean, estimated from the training set.
+
+        Equal to `X.mean(axis=0)`.


double backsticks

summer-bebop · 2018-07-17T22:40:44Z

All right. I’m off for tonight but I’ll take care of it tomorow evening if that’s all right. I’ll take all the comments into account and update the what’s new. Are we still ok with the double deprecation path we chose initialy (0.20 deprecate Default option False, 0.22 deprecate param and change the default to true, 0.24 delete param) ?

summer-bebop · 2018-07-18T23:42:34Z

@GaelVaroquaux @glemaitre @massich Is travis supposed to fail on Depreciation warnings ? Because I think excepted some last changes, that's the last step

GaelVaroquaux · 2018-07-19T06:00:34Z

Yes, travis is supposed to fail on uncaught deprecation warnings. You should catch them in the test, using warning.simplefilter or @pytest.mark.filterwarnings

agramfort · 2018-07-19T07:48:31Z

maths are correct (i think)

@FollowKenny you need to make sure now that normalize_components is set to True in all examples and documentation pages so no deprecation warning pops up.

summer-bebop · 2018-07-19T18:43:12Z

I used git grep "SparsePCA(" and it only showed examples/decomposition/plot_faces_decomposition.py. Am I missing something or is it safe to assume this is the only doc update ?
Tests running

jeremiedbb · 2018-07-19T21:36:00Z

sklearn/decomposition/tests/test_sparse_pca.py

    Y, _, _ = generate_toy_data(3, 10, (8, 8), random_state=rng)
-    with pytest.warns(DeprecationWarning, match="normalize_components"):
-        spca(normalize_components=False).fit(Y)
+    warn_message ="normalize_components"


why not X ? Y is a bit confusing

Indeed but all the other tests are with Y (and equally confusing) so I kept the local convention. Should I change this ?

GaelVaroquaux

LGTM.

+1 for merge

glemaitre

Couple of nitpicks before merging

glemaitre · 2018-07-20T05:50:43Z

sklearn/decomposition/sparse_pca.py

        by `np.random`.

+    normalize_components : boolean, optional (default=False)
+        - if False, Use a version of Sparse PCA without components


if False, Use -> If False, use

glemaitre · 2018-07-20T05:51:01Z

sklearn/decomposition/sparse_pca.py

+          normalization and without data centering. This is likely a bug and
+          even though it's the default for backward compatibility,
+          this should not be used.
+        - if True, Use a version of Sparse PCA with components normalization


if True, Use -> If True, use

glemaitre · 2018-07-20T05:52:49Z

sklearn/decomposition/sparse_pca.py

+          and data centering.
+
+        .. versionadded:: 0.20
+        .. deprecated:: 0.22


I would add a line in between the version added and deprecated.

glemaitre · 2018-07-20T05:54:18Z

sklearn/decomposition/sparse_pca.py

+
+        .. versionadded:: 0.20
+        .. deprecated:: 0.22
+                ``normalize_components`` was added and set to ``False`` for


You should start the line under the "d" of deprecated of the previous line.
Refer to: http://www.sphinx-doc.org/en/stable/markup/para.html#directive-deprecated for an example

glemaitre · 2018-07-20T05:57:04Z

sklearn/decomposition/tests/test_sparse_pca.py

+from sklearn.decomposition import SparsePCA, MiniBatchSparsePCA, PCA
 from sklearn.utils import check_random_state

+import pytest


Put this import in line 5 above import numpy as np

glemaitre · 2018-07-20T05:58:38Z

sklearn/decomposition/sparse_pca.py

        X = check_array(X)
+
+        if self.normalize_components:
+            self.mean_ = np.mean(X, axis=0)


I would almost call X.mean(axis=0) but it is true that we don't support sparse matrix

glemaitre · 2018-07-20T06:00:01Z

sklearn/decomposition/tests/test_sparse_pca.py

-                                    random_state=0).fit(Y).transform(Y)
+                                    random_state=0,
+                                    normalize_components=norm_comp)\
+                .fit(Y).transform(Y)


avoid that \. Just create an instance and then call fit_transform on a second line. It is more readable.

glemaitre · 2018-07-20T06:00:12Z

sklearn/decomposition/tests/test_sparse_pca.py

        U2 = MiniBatchSparsePCA(n_components=3, n_jobs=2, alpha=alpha,
-                                random_state=0).fit(Y).transform(Y)
+                                random_state=0,
+                                normalize_components=norm_comp)\


Same comment than before

glemaitre · 2018-07-20T06:01:02Z

sklearn/decomposition/tests/test_sparse_pca.py

    model.fit(rng.randn(5, 4))
-    assert_array_equal(model.components_, V_init)
+    if norm_comp:
+        assert_array_equal(model.components_,


It seems to be a float comparison. Shall we use assert_allclose

all right but in the else I'm leaving assert_array_equal as this comes from the previous test and it should really be equal as nothing is done on the array in this case.

glemaitre · 2018-07-20T06:01:10Z

sklearn/decomposition/tests/test_sparse_pca.py

+        assert_array_equal(model.components_,
+                           V_init / np.linalg.norm(V_init, axis=1)[:, None])
+    else:
+        assert_array_equal(model.components_, V_init)


It seems to be a float comparison. Shall we use assert_allclose

ah I did not see that you suggested this change too hence my previous comment...

glemaitre · 2018-07-20T21:30:45Z

Waiting for the CI to be green

glemaitre · 2018-07-20T22:00:31Z

@FollowKenny Thanks a lot!!!!

summer-bebop · 2018-07-20T22:14:39Z

Awesome! @agramfort @GaelVaroquaux @glemaitre @massich Thanks for your help!

amueller · 2018-07-21T05:09:14Z

doc/whats_new/v0.20.rst


+- :class:`decomposition.SparsePCA` now exposes ``normalize_components``. When
+  set to True, the train and test data are centered with the train mean 
+  repsectively during the fit phase and the transform phase. This fixes the


respectively

amueller · 2018-07-21T05:12:25Z

I'm surprised we do a backward deprecation for a bug. We haven't really done that in the past and I'm -0 on it. @jnothman you have an opinion?
I'd rather remove the deprecation (does that count as talking you out of it @GaelVaroquaux).
My main argument would be consistency with how we usually do things, which is break behavior if we consider it a bug. I haven't looked at the issue, though from the description it sounds like a bug.

GaelVaroquaux · 2018-07-21T07:43:46Z

Sure, if you want to remove the deprecation, it does count as talking me out of it.

It's really a change in behavior, but the previous behavior had no statistical meaning whatsoever.

GaelVaroquaux · 2018-07-21T07:44:42Z

@FollowKenny : thanks a lot for this work. It was hard and important.

Ivan PANICO added 3 commits July 17, 2018 10:56

WIP sum up findings

4c091cc

deprecation path + test

1f3068f

with flake8 and pep8 corrections

a82bbb8

summer-bebop changed the title ~~[WIP] SPCA centering and fixing scaling issue~~ [MRG] SPCA centering and fixing scaling issue Jul 17, 2018

warning test and typo fixe

6e7308e

massich reviewed Jul 17, 2018

View reviewed changes

massich approved these changes Jul 17, 2018

View reviewed changes

GaelVaroquaux reviewed Jul 17, 2018

View reviewed changes

GaelVaroquaux requested changes Jul 17, 2018

View reviewed changes

GaelVaroquaux mentioned this pull request Jul 17, 2018

Adding explained variances to sparse pca #11527

Closed

glemaitre requested changes Jul 17, 2018

View reviewed changes

with the reviews

0fe1278

Ivan PANICO added 2 commits July 19, 2018 20:34

catch DeprecationWarnings in tests

1a6e0fa

with example update

cd22c6f

fixe warning detection in python 2.7

46067d1

jeremiedbb reviewed Jul 19, 2018

View reviewed changes

flake8...

709bff5

GaelVaroquaux changed the title ~~[MRG] SPCA centering and fixing scaling issue~~ [MRG+1] SPCA centering and fixing scaling issue Jul 20, 2018

GaelVaroquaux approved these changes Jul 20, 2018

View reviewed changes

GaelVaroquaux added the Bug label Jul 20, 2018

GaelVaroquaux added the Blocker label Jul 20, 2018

GaelVaroquaux added this to the 0.20 milestone Jul 20, 2018

glemaitre requested changes Jul 20, 2018

View reviewed changes

glemaitre mentioned this pull request Jul 20, 2018

[MRG+1] iforest backward compatibility #11553

Merged

making this pretty

92e28ee

glemaitre approved these changes Jul 20, 2018

View reviewed changes

glemaitre merged commit 6eb1983 into scikit-learn:master Jul 20, 2018

amueller reviewed Jul 21, 2018

View reviewed changes

summer-bebop deleted the spca branch July 22, 2018 22:28

krassowski mentioned this pull request Mar 23, 2020

SparsePCA inconsistent on sub-sample #10431

Closed

thomasjpfan mentioned this pull request May 23, 2020

MNT Enables subset_invariance tests to run with SparsePCA #17319

Merged


		mean_ : array, shape (n_features,)
		Per-feature empirical mean, estimated from the training set.

Uh oh!

Conversation

summer-bebop commented Jul 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

summer-bebop commented Jul 17, 2018

Uh oh!

massich Jul 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

massich Jul 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

massich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jul 17, 2018

Uh oh!

GaelVaroquaux commented Jul 17, 2018 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

summer-bebop commented Jul 17, 2018

Uh oh!

summer-bebop commented Jul 18, 2018

Uh oh!

GaelVaroquaux commented Jul 19, 2018

Uh oh!

agramfort commented Jul 19, 2018

Uh oh!

summer-bebop commented Jul 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

summer-bebop commented Jul 17, 2018 •

edited

Loading

massich Jul 17, 2018 •

edited

Loading

massich Jul 17, 2018 •

edited

Loading