changed pls_.py to properly scale data, modified documents#6077
changed pls_.py to properly scale data, modified documents#6077OlaPaw wants to merge 2 commits intoscikit-learn:mainfrom OlaPaw:my_PLS
Conversation
and added changes to canonical regression
sklearn/cross_decomposition/pls_.py
Outdated
| self.coef_ = (1. / self.x_std_.reshape((p, 1)) * self.coef_ * | ||
| self.y_std_) | ||
| # self.coef_ = (1. / self.x_std_.reshape((p, 1)) * self.coef_ * | ||
| # self.y_std_) |
There was a problem hiding this comment.
I am not familiar with PLS but if this is no longer needed please delete those commented out lines.
|
Some tests are failing. They might need to be updated or relaxed a bit I am not sure. @arthurmensch can probably have a look once the ICML deadline is over (tomorrow). Also please add a one or more new test to highlight the cases that have been fixed in this PR and that were not properly tested previously. |
sklearn/cross_decomposition/pls_.py
Outdated
| eps = np.finfo(X.dtype).eps | ||
| # Inner loop of the Wold algo. | ||
| # Inner loop of the Wold algo. Very similar to chemometrics::pls2_nipals | ||
| # in R. Tolarence looks at changes in the X matrix rather than Y |
There was a problem hiding this comment.
Tolerance. chemometrics is not a wide known reference, either reference it explicitly or I think it should be removed
|
Quoting @rlgc79:
I think the changes you made to the I think your changes make the code more explicit but I'd like to have another option on this. |
| # 2.3 Update y_score: the Y latent scores | ||
| y_score = np.dot(Y, y_weights) / (np.dot(y_weights.T, y_weights) + eps) | ||
| # y_score = np.dot(Y, y_weights) / np.dot(y_score.T, y_score) ## BUG | ||
| y_score = np.dot(Y, y_weights) |
There was a problem hiding this comment.
Not quite, but the dot product becomes unity, so is not necessary
|
@arthurmensch is @OlaPaw does not reply please feel free to go ahead and extract the bugfix in a separate new PR and please also enable the tests on appveyor as done in #6274 to check whether it fixes #6279. |
|
Hi @arthurmensch and @ogrisel, Yes, I think you have the entire gist of the bug here. If you don't mind, I'll take a look at this over the weekend and test with all my separate test cases. I'm really new to github and debugging code, so thank you for your patience. I just found a lot of the PLS code somewhat cryptic as it doesn't seem to follow the typical conventions I've seen in literature. Finally, scikit-learn already has pre-scaling function and I'm wondering whether we should even scale / center anything in this function. |
no need for separate check if scale is True, as self_std is set to ones if set to False Fixed re-scaling and re-centering issues reported previously
|
Ok, I've removed the cosmetic changes, and just fixed the issues with improper scaling and centering. I tried to leave the rest of the code as pure as possible. |
|
Can you please add a non-regression test and update the existing tests (there are some failures reported by travis). |
Unfortunately it does not, I just checked. |
|
@OlaPaw are you still working on this? It would be good to have. |
|
I'm now using PLSRegression. Pls let me know if I can help complete this PR. I got here after noticing my outputs of predict were out of scale. |
|
@jabenitez feel free to help with this by adding the appropriate tests. You can create a new PR based on this one. |
|
@Gscorreia89 as a non PLS expert, it is hard to figure out what remains to be done on this one now that #7819 has been merged. Would you be kind enough to elaborate a bit? |
scalesetting was ignored. Data was always centered, and always scaled. The data is now scaled appropriately whenscale=Trueis set, and not scaled when it'sFalsepredict()function was always de-centered, but never de-scaled, so that we got:Ypred = Ypred + self.y_mean_for all casesinstead of:
Ypred = (Ypred * self.y_std_) + self.y_mean_whenscale=Truecoef_is calculated differently depending on those two cases. The calculatedcoef_was also scaled unnecesarily by standard deviations of X and Y: this is now commented out. Only matrices W and C are scaled by their norms in the inner loop. The outer loop scales P, Q, and C (canonical and regression) matrices byT.T*TorU.T*Usklearn.preprocessing.scaleare included, as that could be a useful step for most users.sklearn.preprocessing.scale(or column mean centered and column variance scaled), the pls_ functions all return the SAME results whenscale=Trueandscale=Falsethis was not the case in the previous version of the code