[MRG] DOC data centering in PCA by ilemhadri · Pull Request #9934 · scikit-learn/scikit-learn

ilemhadri · 2017-10-16T16:56:36Z

Following the discussion on the mailing list [subject: unclear help file for sklearn.decomposition.pca, author:Ismael Lemhadri, date:16 Oct 2017], modified the help file of "sklearn/decomposition/pca.py" to make it clear that the data matrix is centered (but not scaled) before performing singular value decomposition.

Reference Issue

What does this implement/fix? Explain your changes.

improve the PCA documentation on data centering

Any other comments?

…ile for sklearn.decomposition.pca, author:Ismael Lemhadri, date:16 Oct 2017], modified the help file of "sklearn/decomposition/pca.py" to make it clear that the data matrix is centered (but not scaled) before performing singular value decomposition.

rth

Thanks for the PR!

Please have a look at the Travis CI output below: the flake8 job failed because there are some formatting issues.

rth · 2017-10-16T19:09:19Z

sklearn/decomposition/pca.py

    SVD by the method of Halko et al. 2009, depending on the shape of the input
-    data and the number of components to extract.
+    data and the number of components to extract. It centers the data matrix
+    columnwise (but does not scale it) before performing Singular Value Decomposition.


I wouldn't mention the scaling at all; there is no reason why PCA (or any algorithm) would do a preprocessing unless otherwise stated. Maybe,

The input data is centered for each feature before applying the SVD.

or something similar?
The SVD abbreviation is used the previous sentence, so there is no need to explain what it is.

rth · 2017-10-16T19:16:29Z

Also please add "[MRG] DOC" to the title of this PR, for instance by renaming it to "[MRG] DOC data centering in PCA"

ilemhadri · 2017-10-16T22:30:04Z

@rth : I have edited both the docstring and the userguide to reflect your suggestions. I am not sure what needs to be done to correct the flake8 formatting issue though.

rth · 2017-10-16T22:45:29Z

@ilemhadri The flake8 are due to the trailing whitespace you added L111 (see diff) and since the previous version had a line with more than 80 char (this should be fine now).

ilemhadri · 2017-10-17T02:28:42Z

@rth this should be fixed now.

amueller · 2017-10-17T20:10:08Z

sklearn/decomposition/pca.py

    It uses the LAPACK implementation of the full SVD or a randomized truncated
    SVD by the method of Halko et al. 2009, depending on the shape of the input
-    data and the number of components to extract.
+    data and the number of components to extract. The input data is centered


Maybe move that to the first paragraph above? Otherwise seems good.

rth · 2017-10-17T22:34:14Z

doc/modules/decomposition.rst

 clustering algorithm.

+Note: the :class:`PCA` object centers the input data for each feature before
+applying the SVD.


Maybe move this to the beginning of the previous paragraph (and remove "Note" and "class..") ? .e.g

PCA centers [...] before applying the SVD. The optional parameter whiten etc..

rth · 2017-10-17T22:35:27Z

Added a small comment above. Otherwise LGTM.

jnothman · 2017-10-18T00:42:23Z

It seems my email reply did not go through:
What is the centering behaviour in IncrementalPCA and SparsePCA, for comparison?

rth · 2017-10-18T09:20:05Z

IncrementalPCA also centers the data; wasn't able to tell from looking the SparcePCA code...

rth · 2017-10-18T09:22:40Z

I mean there is nothing that looks like centering in SparsePCA, unless the way ridge_regression or dict_learning is used has some centering effects. I'm not familiar with that code..

jnothman · 2017-10-18T09:31:32Z

so I'd suggest that we make parallel changes to IPCA at least

ilemhadri · 2017-10-19T01:14:34Z

@rth @amueller I have now taken into accounts your remarks in my latest push.

ilemhadri · 2017-10-25T02:40:31Z

@rth @amueller is there anything else I need to do before these changes are approved?

rth · 2017-10-26T12:11:06Z

@ilemhadri I don't have any other comments, it looks good to me. Maybe just address @jnothman 's comment about IPCA please. Then the PR would need to be accepted by a core developer.

Thanks for your contribution!

glemaitre · 2017-11-28T13:40:15Z

@ilemhadri I agree with @jnothman that you should also make an addition in the IncrementalICA.
Regarding SparsePCA, I checked briefly but I don't think that the data can be aligned (it will destroyed the sparsity) .

amueller · 2017-11-28T22:36:59Z

Should we say "centered but not scaled" to be super explicit?

glemaitre · 2017-11-28T22:51:55Z

Should we say "centered but not scaled" to be super explicit?

Looks good phrasing to me.

abenbihi · 2019-02-25T11:05:18Z

I am adding the explicitation "centered but not scaled" for PCA. And making the same modifications for IPCA.

GaelVaroquaux · 2019-02-25T12:54:34Z

Closing in favor of the continuation #13242

rth reviewed Oct 16, 2017

View reviewed changes

ilemhadri changed the title ~~improve the documentation of sklearn.decomposition.pca on data centering~~ [MRG] DOC data centering in PCA Oct 16, 2017

modified the docstring and the user guide following the comment of rth

5a5bf30

stats207 added 2 commits October 16, 2017 17:39

[MRG] DOC on data centering in PCA

6652b23

removed trailing space from pca.py

58e30e0

amueller reviewed Oct 17, 2017

View reviewed changes

rth reviewed Oct 17, 2017

View reviewed changes

moved it the beginning of the previous paragraph

e4d6005

TomDLT added the Documentation label Oct 20, 2017

rth added the Stalled label Jul 22, 2018

amueller added Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted labels Aug 21, 2018

abenbihi mentioned this pull request Feb 25, 2019

[MRG+2] DOC data centering in PCA #13242

Merged

GaelVaroquaux closed this Feb 25, 2019

Uh oh!

Conversation

ilemhadri commented Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

rth Oct 16, 2017

Choose a reason for hiding this comment

Uh oh!

rth commented Oct 16, 2017

Uh oh!

ilemhadri commented Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Oct 16, 2017

Uh oh!

ilemhadri commented Oct 17, 2017

Uh oh!

amueller Oct 17, 2017

Choose a reason for hiding this comment

Uh oh!

ilemhadri Oct 18, 2017

Choose a reason for hiding this comment

Uh oh!

rth Oct 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilemhadri Oct 18, 2017

Choose a reason for hiding this comment

Uh oh!

rth commented Oct 17, 2017

Uh oh!

jnothman commented Oct 18, 2017

Uh oh!

rth commented Oct 18, 2017

Uh oh!

rth commented Oct 18, 2017

Uh oh!

jnothman commented Oct 18, 2017 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilemhadri commented Oct 19, 2017

Uh oh!

ilemhadri commented Oct 25, 2017

Uh oh!

rth commented Oct 26, 2017

Uh oh!

glemaitre commented Nov 28, 2017

Uh oh!

amueller commented Nov 28, 2017

Uh oh!

glemaitre commented Nov 28, 2017

Uh oh!

abenbihi commented Feb 25, 2019

Uh oh!

GaelVaroquaux commented Feb 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

ilemhadri commented Oct 16, 2017 •

edited

Loading

ilemhadri commented Oct 16, 2017 •

edited

Loading

rth Oct 17, 2017 •

edited

Loading

jnothman commented Oct 18, 2017 via email •

edited

Loading