[MRG] DOC data centering in PCA#9934
Conversation
…ile for sklearn.decomposition.pca, author:Ismael Lemhadri, date:16 Oct 2017], modified the help file of "sklearn/decomposition/pca.py" to make it clear that the data matrix is centered (but not scaled) before performing singular value decomposition.
sklearn/decomposition/pca.py
Outdated
| SVD by the method of Halko et al. 2009, depending on the shape of the input | ||
| data and the number of components to extract. | ||
| data and the number of components to extract. It centers the data matrix | ||
| columnwise (but does not scale it) before performing Singular Value Decomposition. |
There was a problem hiding this comment.
I wouldn't mention the scaling at all; there is no reason why PCA (or any algorithm) would do a preprocessing unless otherwise stated. Maybe,
The input data is centered for each feature before applying the SVD.
or something similar?
The SVD abbreviation is used the previous sentence, so there is no need to explain what it is.
|
Also please add "[MRG] DOC" to the title of this PR, for instance by renaming it to "[MRG] DOC data centering in PCA" |
|
@rth : I have edited both the docstring and the userguide to reflect your suggestions. I am not sure what needs to be done to correct the flake8 formatting issue though. |
|
@ilemhadri The flake8 are due to the trailing whitespace you added L111 (see diff) and since the previous version had a line with more than 80 char (this should be fine now). |
|
@rth this should be fixed now. |
sklearn/decomposition/pca.py
Outdated
| It uses the LAPACK implementation of the full SVD or a randomized truncated | ||
| SVD by the method of Halko et al. 2009, depending on the shape of the input | ||
| data and the number of components to extract. | ||
| data and the number of components to extract. The input data is centered |
There was a problem hiding this comment.
Maybe move that to the first paragraph above? Otherwise seems good.
doc/modules/decomposition.rst
Outdated
| clustering algorithm. | ||
|
|
||
| Note: the :class:`PCA` object centers the input data for each feature before | ||
| applying the SVD. |
There was a problem hiding this comment.
Maybe move this to the beginning of the previous paragraph (and remove "Note" and "class..") ? .e.g
PCA centers [...] before applying the SVD. The optional parameter whiten etc..
|
Added a small comment above. Otherwise LGTM. |
|
It seems my email reply did not go through: |
|
IncrementalPCA also centers the data; wasn't able to tell from looking the SparcePCA code... |
|
I mean there is nothing that looks like centering in SparsePCA, unless the way |
|
so I'd suggest that we make parallel changes to IPCA at least
|
|
@ilemhadri I don't have any other comments, it looks good to me. Maybe just address @jnothman 's comment about IPCA please. Then the PR would need to be accepted by a core developer. Thanks for your contribution! |
|
@ilemhadri I agree with @jnothman that you should also make an addition in the IncrementalICA. |
|
Should we say "centered but not scaled" to be super explicit? |
Looks good phrasing to me. |
|
I am adding the explicitation "centered but not scaled" for PCA. And making the same modifications for IPCA. |
|
Closing in favor of the continuation #13242 |
Following the discussion on the mailing list [subject: unclear help file for sklearn.decomposition.pca, author:Ismael Lemhadri, date:16 Oct 2017], modified the help file of "sklearn/decomposition/pca.py" to make it clear that the data matrix is centered (but not scaled) before performing singular value decomposition.
Reference Issue
What does this implement/fix? Explain your changes.
improve the PCA documentation on data centering
Any other comments?