Skip to content

[MRG] DOC data centering in PCA#9934

Closed
ilemhadri wants to merge 5 commits intoscikit-learn:masterfrom
ilemhadri:pca-help
Closed

[MRG] DOC data centering in PCA#9934
ilemhadri wants to merge 5 commits intoscikit-learn:masterfrom
ilemhadri:pca-help

Conversation

@ilemhadri
Copy link
Copy Markdown

@ilemhadri ilemhadri commented Oct 16, 2017

Following the discussion on the mailing list [subject: unclear help file for sklearn.decomposition.pca, author:Ismael Lemhadri, date:16 Oct 2017], modified the help file of "sklearn/decomposition/pca.py" to make it clear that the data matrix is centered (but not scaled) before performing singular value decomposition.

Reference Issue

What does this implement/fix? Explain your changes.

improve the PCA documentation on data centering

Any other comments?

…ile for sklearn.decomposition.pca, author:Ismael Lemhadri, date:16 Oct 2017], modified the help file of "sklearn/decomposition/pca.py" to make it clear that the data matrix is centered (but not scaled) before performing singular value decomposition.
Copy link
Copy Markdown
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Please have a look at the Travis CI output below: the flake8 job failed because there are some formatting issues.

SVD by the method of Halko et al. 2009, depending on the shape of the input
data and the number of components to extract.
data and the number of components to extract. It centers the data matrix
columnwise (but does not scale it) before performing Singular Value Decomposition.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mention the scaling at all; there is no reason why PCA (or any algorithm) would do a preprocessing unless otherwise stated. Maybe,

The input data is centered for each feature before applying the SVD. 

or something similar?
The SVD abbreviation is used the previous sentence, so there is no need to explain what it is.

@rth
Copy link
Copy Markdown
Member

rth commented Oct 16, 2017

Also please add "[MRG] DOC" to the title of this PR, for instance by renaming it to "[MRG] DOC data centering in PCA"

@ilemhadri ilemhadri changed the title improve the documentation of sklearn.decomposition.pca on data centering [MRG] DOC data centering in PCA Oct 16, 2017
@ilemhadri
Copy link
Copy Markdown
Author

ilemhadri commented Oct 16, 2017

@rth : I have edited both the docstring and the userguide to reflect your suggestions. I am not sure what needs to be done to correct the flake8 formatting issue though.

@rth
Copy link
Copy Markdown
Member

rth commented Oct 16, 2017

@ilemhadri The flake8 are due to the trailing whitespace you added L111 (see diff) and since the previous version had a line with more than 80 char (this should be fine now).

@ilemhadri
Copy link
Copy Markdown
Author

@rth this should be fixed now.

It uses the LAPACK implementation of the full SVD or a randomized truncated
SVD by the method of Halko et al. 2009, depending on the shape of the input
data and the number of components to extract.
data and the number of components to extract. The input data is centered
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move that to the first paragraph above? Otherwise seems good.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

clustering algorithm.

Note: the :class:`PCA` object centers the input data for each feature before
applying the SVD.
Copy link
Copy Markdown
Member

@rth rth Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this to the beginning of the previous paragraph (and remove "Note" and "class..") ? .e.g

PCA centers [...] before applying the SVD. The optional parameter whiten etc..

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@rth
Copy link
Copy Markdown
Member

rth commented Oct 17, 2017

Added a small comment above. Otherwise LGTM.

@jnothman
Copy link
Copy Markdown
Member

It seems my email reply did not go through:
What is the centering behaviour in IncrementalPCA and SparsePCA, for comparison?

@rth
Copy link
Copy Markdown
Member

rth commented Oct 18, 2017

IncrementalPCA also centers the data; wasn't able to tell from looking the SparcePCA code...

@rth
Copy link
Copy Markdown
Member

rth commented Oct 18, 2017

I mean there is nothing that looks like centering in SparsePCA, unless the way ridge_regression or dict_learning is used has some centering effects. I'm not familiar with that code..

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Oct 18, 2017 via email

@ilemhadri
Copy link
Copy Markdown
Author

@rth @amueller I have now taken into accounts your remarks in my latest push.

@ilemhadri
Copy link
Copy Markdown
Author

@rth @amueller is there anything else I need to do before these changes are approved?

@rth
Copy link
Copy Markdown
Member

rth commented Oct 26, 2017

@ilemhadri I don't have any other comments, it looks good to me. Maybe just address @jnothman 's comment about IPCA please. Then the PR would need to be accepted by a core developer.

Thanks for your contribution!

@glemaitre
Copy link
Copy Markdown
Member

@ilemhadri I agree with @jnothman that you should also make an addition in the IncrementalICA.
Regarding SparsePCA, I checked briefly but I don't think that the data can be aligned (it will destroyed the sparsity) .

@amueller
Copy link
Copy Markdown
Member

Should we say "centered but not scaled" to be super explicit?

@glemaitre
Copy link
Copy Markdown
Member

Should we say "centered but not scaled" to be super explicit?

Looks good phrasing to me.

@rth rth added the Stalled label Jul 22, 2018
@amueller amueller added Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted labels Aug 21, 2018
@abenbihi
Copy link
Copy Markdown
Contributor

I am adding the explicitation "centered but not scaled" for PCA. And making the same modifications for IPCA.

@GaelVaroquaux
Copy link
Copy Markdown
Member

Closing in favor of the continuation #13242

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted Stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants