[MRG+1] SPCA centering and fixing scaling issue#11585
[MRG+1] SPCA centering and fixing scaling issue#11585glemaitre merged 10 commits intoscikit-learn:masterfrom
Conversation
|
@agramfort Rdy for first review (if tests pass) |
sklearn/decomposition/sparse_pca.py
Outdated
| If None, the random number generator is the RandomState instance used | ||
| by `np.random`. | ||
|
|
||
| normalize_components : boolean |
There was a problem hiding this comment.
I would show in the docstring that normalize_components is optional and which is its default.
like here:
scikit-learn/sklearn/ensemble/bagging.py
Line 477 in bcd6ff3
sklearn/decomposition/sparse_pca.py
Outdated
| normalize_components : boolean | ||
| False : Use a version of Sparse PCA without components normalization | ||
| and without data centring. This is likely a bug and eventhough it's | ||
| the default for backward compatibility, this should not be used |
There was a problem hiding this comment.
if normalize_components=False should not be used it but its the default for back compatibility, the usual deprecation cycle should be added
There was a problem hiding this comment.
my fault. You already have the deprecation warning. You should add when wash this added and when would it be removed, in the docstring
There was a problem hiding this comment.
class SparsePCA(BaseEstimator, TransformerMixin):
"""Sparse Principal Components Analysis (SparsePCA)
...
Parameters
----------
...
normalize_components : boolean, optional (default=False)
False : Use a version of Sparse PCA without ...
.. versionadded:: 0.20
.. deprecated:: 0.22
``normalize_components`` was added and set to ``False`` for bakward compatibility. It would be set to ``True`` from 0.22 onwards.There was a problem hiding this comment.
Now I'm not sure if we do this, so feel free to contradict me.
There was a problem hiding this comment.
Yes, technically we add the parameter, then change its value, then remove it. It's long!
There was a problem hiding this comment.
If it is a bug fix we should just change the behavior, isn't it?
massich
left a comment
There was a problem hiding this comment.
would it normalize_components be removed in version 0.22 and kept to true?
Or would that be a second deprecation cycle, and removed in 0.24?
if removed in 0.22, it should be stated in the deprecation messages.
sklearn/decomposition/sparse_pca.py
Outdated
| normalize_components : boolean | ||
| False : Use a version of Sparse PCA without components normalization | ||
| and without data centring. This is likely a bug and eventhough it's | ||
| the default for backward compatibility, this should not be used |
There was a problem hiding this comment.
Now I'm not sure if we do this, so feel free to contradict me.
| random_state=rng, normalize_components=True) | ||
| results_train = spca_lars.fit_transform(Y) | ||
| results_test = spca_lars.transform(Y[:10]) | ||
| assert_array_equal(results_train[0], results_test[0]) |
There was a problem hiding this comment.
This is probably too stringent (and it's failing on travis). You probably want to use assert_array_almost_equal.
sklearn/decomposition/sparse_pca.py
Outdated
| self.mean_ = np.mean(X, axis=0) | ||
| X = X - self.mean_ | ||
| else: | ||
| warnings.warn("normalize_components should be set to True. " |
There was a problem hiding this comment.
I would be more explicit: "normalize_components=False is a backward-compatible setting that implements a non-standard definition of sparse PCA. This compatibility mode will be removed in 0.22."
GaelVaroquaux
left a comment
There was a problem hiding this comment.
Aside from the small comments I made (including fixing travis), there needs to be an entry added to whats_new. This is an important change.
|
I am confused here. Do we actually want to keep the buggy behaviour? |
|
I am confused here. Do we actually want to keep the buggy behaviour?
Just in case people were relying on it.
I can be talked out of this opinion.
|
sklearn/decomposition/sparse_pca.py
Outdated
| by `np.random`. | ||
|
|
||
| normalize_components : boolean | ||
| False : Use a version of Sparse PCA without components normalization |
There was a problem hiding this comment.
Use indent
* If False, ...
* If True, ...
sklearn/decomposition/sparse_pca.py
Outdated
|
|
||
| normalize_components : boolean | ||
| False : Use a version of Sparse PCA without components normalization | ||
| and without data centring. This is likely a bug and eventhough it's |
sklearn/decomposition/sparse_pca.py
Outdated
| normalize_components : boolean | ||
| False : Use a version of Sparse PCA without components normalization | ||
| and without data centring. This is likely a bug and eventhough it's | ||
| the default for backward compatibility, this should not be used |
There was a problem hiding this comment.
If it is a bug fix we should just change the behavior, isn't it?
sklearn/decomposition/sparse_pca.py
Outdated
| and without data centring. This is likely a bug and eventhough it's | ||
| the default for backward compatibility, this should not be used | ||
| True : Use a version of Sparse PCA with components normalization | ||
| and data centring |
sklearn/decomposition/sparse_pca.py
Outdated
|
|
||
| mean_ : array, shape (n_features,) | ||
| Per-feature empirical mean, estimated from the training set. | ||
|
|
sklearn/decomposition/sparse_pca.py
Outdated
| mean_ : array, shape (n_features,) | ||
| Per-feature empirical mean, estimated from the training set. | ||
|
|
||
| Equal to `X.mean(axis=0)`. |
|
All right. I’m off for tonight but I’ll take care of it tomorow evening if that’s all right. I’ll take all the comments into account and update the what’s new. Are we still ok with the double deprecation path we chose initialy (0.20 deprecate Default option False, 0.22 deprecate param and change the default to true, 0.24 delete param) ? |
|
@GaelVaroquaux @glemaitre @massich Is travis supposed to fail on Depreciation warnings ? Because I think excepted some last changes, that's the last step |
|
Yes, travis is supposed to fail on uncaught deprecation warnings. You should catch them in the test, using warning.simplefilter or @pytest.mark.filterwarnings |
|
maths are correct (i think) @FollowKenny you need to make sure now that normalize_components is set to True in all examples and documentation pages so no deprecation warning pops up. |
|
I used git grep "SparsePCA(" and it only showed examples/decomposition/plot_faces_decomposition.py. Am I missing something or is it safe to assume this is the only doc update ? |
| Y, _, _ = generate_toy_data(3, 10, (8, 8), random_state=rng) | ||
| with pytest.warns(DeprecationWarning, match="normalize_components"): | ||
| spca(normalize_components=False).fit(Y) | ||
| warn_message ="normalize_components" |
There was a problem hiding this comment.
why not X ? Y is a bit confusing
There was a problem hiding this comment.
Indeed but all the other tests are with Y (and equally confusing) so I kept the local convention. Should I change this ?
glemaitre
left a comment
There was a problem hiding this comment.
Couple of nitpicks before merging
sklearn/decomposition/sparse_pca.py
Outdated
| by `np.random`. | ||
|
|
||
| normalize_components : boolean, optional (default=False) | ||
| - if False, Use a version of Sparse PCA without components |
There was a problem hiding this comment.
if False, Use -> If False, use
sklearn/decomposition/sparse_pca.py
Outdated
| normalization and without data centering. This is likely a bug and | ||
| even though it's the default for backward compatibility, | ||
| this should not be used. | ||
| - if True, Use a version of Sparse PCA with components normalization |
| and data centering. | ||
|
|
||
| .. versionadded:: 0.20 | ||
| .. deprecated:: 0.22 |
There was a problem hiding this comment.
I would add a line in between the version added and deprecated.
sklearn/decomposition/sparse_pca.py
Outdated
|
|
||
| .. versionadded:: 0.20 | ||
| .. deprecated:: 0.22 | ||
| ``normalize_components`` was added and set to ``False`` for |
There was a problem hiding this comment.
You should start the line under the "d" of deprecated of the previous line.
Refer to: http://www.sphinx-doc.org/en/stable/markup/para.html#directive-deprecated for an example
| from sklearn.decomposition import SparsePCA, MiniBatchSparsePCA, PCA | ||
| from sklearn.utils import check_random_state | ||
|
|
||
| import pytest |
There was a problem hiding this comment.
Put this import in line 5 above import numpy as np
sklearn/decomposition/sparse_pca.py
Outdated
| X = check_array(X) | ||
|
|
||
| if self.normalize_components: | ||
| self.mean_ = np.mean(X, axis=0) |
There was a problem hiding this comment.
I would almost call X.mean(axis=0) but it is true that we don't support sparse matrix
| random_state=0).fit(Y).transform(Y) | ||
| random_state=0, | ||
| normalize_components=norm_comp)\ | ||
| .fit(Y).transform(Y) |
There was a problem hiding this comment.
avoid that \. Just create an instance and then call fit_transform on a second line. It is more readable.
| U2 = MiniBatchSparsePCA(n_components=3, n_jobs=2, alpha=alpha, | ||
| random_state=0).fit(Y).transform(Y) | ||
| random_state=0, | ||
| normalize_components=norm_comp)\ |
| model.fit(rng.randn(5, 4)) | ||
| assert_array_equal(model.components_, V_init) | ||
| if norm_comp: | ||
| assert_array_equal(model.components_, |
There was a problem hiding this comment.
It seems to be a float comparison. Shall we use assert_allclose
There was a problem hiding this comment.
all right but in the else I'm leaving assert_array_equal as this comes from the previous test and it should really be equal as nothing is done on the array in this case.
| assert_array_equal(model.components_, | ||
| V_init / np.linalg.norm(V_init, axis=1)[:, None]) | ||
| else: | ||
| assert_array_equal(model.components_, V_init) |
There was a problem hiding this comment.
It seems to be a float comparison. Shall we use assert_allclose
There was a problem hiding this comment.
ah I did not see that you suggested this change too hence my previous comment...
|
Waiting for the CI to be green |
|
@FollowKenny Thanks a lot!!!! |
|
Awesome! @agramfort @GaelVaroquaux @glemaitre @massich Thanks for your help! |
|
|
||
| - :class:`decomposition.SparsePCA` now exposes ``normalize_components``. When | ||
| set to True, the train and test data are centered with the train mean | ||
| repsectively during the fit phase and the transform phase. This fixes the |
|
I'm surprised we do a backward deprecation for a bug. We haven't really done that in the past and I'm -0 on it. @jnothman you have an opinion? |
|
Sure, if you want to remove the deprecation, it does count as talking me out of it. It's really a change in behavior, but the previous behavior had no statistical meaning whatsoever. |
|
@FollowKenny : thanks a lot for this work. It was hard and important. |
Reference Issues/PRs
fixes #9394
What does this implement/fix? Explain your changes.
@Andrewww reported a scaling issue with SPCA : the transform scaling was depending on the number of samples. Investigating this issue we found several things that were disturbing :
This PR aims to fix all that. With all the fixes implemented, the scaling issue disappear and the transform from PCA and SPCA(alpha=0, ridge_alpha=0) gives exactly the same results.
Any other comments?
TODO :