Skip to content

FIX DOC MNT Big revamp of cross-decomposition module#17095

Merged
NicolasHug merged 33 commits intoscikit-learn:masterfrom
NicolasHug:la_PLS_qui_porte_si_bien_son_nom
Aug 31, 2020
Merged

FIX DOC MNT Big revamp of cross-decomposition module#17095
NicolasHug merged 33 commits intoscikit-learn:masterfrom
NicolasHug:la_PLS_qui_porte_si_bien_son_nom

Conversation

@NicolasHug
Copy link
Copy Markdown
Member

@NicolasHug NicolasHug commented Apr 30, 2020

Closes #4122
Closes #8392
Probably Fixes #4469, though I can't reproduce.
Fixes #11645
Closes #13521
Closes #16177

This PR is a rework of the cross-decomposition module which had been left for dead for years. Mainly, docs and tests were added, and code was simplified. This hopefully comes with a few bug fixes.

This is a big PR, but it's a lot of docs. You might want to ignore the diff and just review the files from scratch, given the amount of changes. The good news is that there are docs now, so you're off to a much better start than I was.

Other stuff:

  • use 1d shapes for vectors instead of 2d shapes. Makes outer product more obvious.

  • for PLSSVD, CCA, PLSCanonical n_components now raises a FutureWarning if it's not in [1, min(n_samples, n_features, n_targets)]. This is only for backward compat, and an error will be raised in 2 versions: n_components cannot be greater than the rank of the cross-covariance matrix = X.T.dot(Y) which is bounded as above. For PLSRegression, the rank is bounded by n_features. See comments in code.

  • PLSSVD, CCA and PLSCanonical:

    • deprecated x_scores and y_scores attributes since they're useless: they're just the transformed matrices of X and Y. For PLSRegression, y_scores are different, so I didn't deprecate (But I doubt these are useful anyway)

@NicolasHug NicolasHug changed the title [WIP] Reworking of PLS [MRG] Reworking of PLS May 4, 2020
@NicolasHug NicolasHug changed the title [MRG] Reworking of PLS [MRG] Reworking and docs for cross-decomposition module May 4, 2020
@NicolasHug
Copy link
Copy Markdown
Member Author

CC @thomasjpfan

@amueller you have time for reviews now, right? Welcome back!! :p

Copy link
Copy Markdown
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleaning !

Do we need a test for #11645 ?

Classes included in this module are :class:`PLSRegression`
Apart from CCA, the PLS estimators are particularly suited when the matrix of
predictors has more variables than observations, and when there is
multicollinearity among the features. By contrast, standard linear regression
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that it fixes the multicollinearity issue while linear regression would fail unless regularization seems an authoritative argument here. Do we have a ref or example that demo this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When features are collinear, the covariance matrix is ill-defined and thus non-invertible.

I don't have a ref at hand, but I'm assuming this is common knowledge (could be wrong?)

Copy link
Copy Markdown
Member

@agramfort agramfort Aug 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you referring to this https://github.com/scikit-learn/scikit-learn/pull/17095/files#diff-df97917f68917d3a110df30940d771dfR176 ? is it a statement of PLS vs CCA rather than PLS vs linear regression. Maybe I am nitpicking here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly a statement about PLS vs non-regularized LR: LR is unstable when there is collinearity among features, PLS is not. However, CCA is unstable too (as described in the link you mentioned).

Copy link
Copy Markdown
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @NicolasHug for clarifying

@NicolasHug
Copy link
Copy Markdown
Member Author

Thanks @TomDLT and @agramfort for the reviews!

Let me push a what'snew entry, and I'll merge when green

@NicolasHug NicolasHug changed the title [MRG] Reworking and docs for cross-decomposition module FIX DOC MNT Big revamp of cross-decomposition module Aug 31, 2020
@NicolasHug NicolasHug merged commit 8061aac into scikit-learn:master Aug 31, 2020
@NicolasHug NicolasHug deleted the la_PLS_qui_porte_si_bien_son_nom branch August 31, 2020 16:38
@NicolasHug
Copy link
Copy Markdown
Member Author

Docs look good, merging.
Thanks for the reviews!

@hendriklohse
Copy link
Copy Markdown

@NicolasHug I have a question about the documentation. What exactly does the attribute x_loadings_ represent? The correlation, or the coefficients of the linear combinations used to transform X? (See also https://stackoverflow.com/questions/78725061/what-does-the-attribute-x-loadings-represent?noredirect=1#comment138800857_78725061)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

6 participants