-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
_PLS y_rotations and discrepancies in U scores #8392
Description
Description
In the _PLS object, the U scores (latent variable scores referring to the Y matrix/vector) are calculated correctly during the fitting process, for both NIPALS and SVD algorithms, and stored immediately. But in the last steps before ending the fit method, the x and y rotations are calculated and stored. These were not used in the .fit method to obtain any of the model parameters, but will be used in the future in the .transform method, as they provide a much more straightforward way to calculate T (X scores) and U (Y scores) for new samples. For X all is fine, but in the y rotations, in the cases where y is a single vector the rotation is not calculated and is just set to 1.
Snippet from the original code:
if Y.shape[1] > 1:
self.y_rotations_ = np.dot(
self.y_weights_,
linalg.pinv2(np.dot(self.y_loadings_.T, self.y_weights_),
**pinv2_args))
else:
self.y_rotations_ = np.ones(1)
Because of this, the formula U = Y * y_rotations_ used in the transform method will not give the same scores as the ones obtained during the initial model fitting process.
I suggest removing the dimension check here and just calculate the rotations consistently. This will fix the inconsistencies between the U scores obtained with the .transform method and the U scores obtained during the .fit method.
Steps/Code to Reproduce
from sklearn.cross_decomposition import PLSRegression
import numpy as np
# some random data
x = np.random.randn(100, 100)
y = np.random.randn(100)
# Mean center
xc = x - np.mean(x, 0)
yc = y - np.mean(y)
# Fit
pls = PLSRegression(1, scale=False)
pls.fit(xc, yc)
# The "correct" U_scores obtained during the fitting process
u_scores = pls.y_scores_
# U scores from the transform method
transform_uscores = pls.transform(X=xc, Y=yc)[1]
# Calculate the y_rotations manually, using the same formula as sklearn in multiy : C* = pinv(CQ')C
y_rotations = np.dot(np.linalg.pinv(np.dot(pls.y_weights_, pls.y_loadings_.T)), pls.y_weights_)
# U = YC*, using the re-calculated rotations directly
re_u = np.dot(yc.reshape(-1, 1), y_rotations)
# not zero...
np.max(u_scores - transform_uscores)
# close to zero
np.max(u_scores - re_u)
Expected Results
The U scores (or y_scores_ ) calculated in this context with the .transform method and the y_scores_ attribute from a fitted _PLS object should be the same.
Actual Results
When the Y is a single vector, the y_rotations_ attribute is incorrectly set to 1, and therefore the scores performed using the transform method are not correct (scaling difference). Everything is OK when running PLS algorithms with a Y matrix.
Versions
Linux-4.4.0-59-generic-x86_64-with-debian-stretch-sid
Python 3.5.1 |Anaconda custom (64-bit)| (default, Dec 7 2015, 11:16:01)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.12.0
SciPy 0.18.1
Scikit-Learn 0.18.1