Skip to content

_PLS y_rotations and discrepancies in U scores  #8392

@Gscorreia89

Description

@Gscorreia89

Description

In the _PLS object, the U scores (latent variable scores referring to the Y matrix/vector) are calculated correctly during the fitting process, for both NIPALS and SVD algorithms, and stored immediately. But in the last steps before ending the fit method, the x and y rotations are calculated and stored. These were not used in the .fit method to obtain any of the model parameters, but will be used in the future in the .transform method, as they provide a much more straightforward way to calculate T (X scores) and U (Y scores) for new samples. For X all is fine, but in the y rotations, in the cases where y is a single vector the rotation is not calculated and is just set to 1.

Snippet from the original code:

        if Y.shape[1] > 1:
            self.y_rotations_ = np.dot(
                self.y_weights_,
                linalg.pinv2(np.dot(self.y_loadings_.T, self.y_weights_),
                             **pinv2_args))
        else:
            self.y_rotations_ = np.ones(1)

Because of this, the formula U = Y * y_rotations_ used in the transform method will not give the same scores as the ones obtained during the initial model fitting process.

I suggest removing the dimension check here and just calculate the rotations consistently. This will fix the inconsistencies between the U scores obtained with the .transform method and the U scores obtained during the .fit method.

Steps/Code to Reproduce

from sklearn.cross_decomposition import PLSRegression
import numpy as np

# some random data
x = np.random.randn(100, 100)
y = np.random.randn(100)

# Mean center
xc = x - np.mean(x, 0)
yc = y - np.mean(y)

# Fit
pls = PLSRegression(1, scale=False)

pls.fit(xc, yc)

# The "correct" U_scores obtained during the fitting process
u_scores = pls.y_scores_

# U scores from the transform method
transform_uscores = pls.transform(X=xc, Y=yc)[1]

# Calculate the y_rotations manually, using the same formula as sklearn in multiy : C* = pinv(CQ')C
y_rotations = np.dot(np.linalg.pinv(np.dot(pls.y_weights_, pls.y_loadings_.T)), pls.y_weights_)

# U = YC*, using the re-calculated rotations directly
re_u = np.dot(yc.reshape(-1, 1), y_rotations)

# not zero...
np.max(u_scores - transform_uscores)

# close to zero
np.max(u_scores - re_u)

Expected Results

The U scores (or y_scores_ ) calculated in this context with the .transform method and the y_scores_ attribute from a fitted _PLS object should be the same.

Actual Results

When the Y is a single vector, the y_rotations_ attribute is incorrectly set to 1, and therefore the scores performed using the transform method are not correct (scaling difference). Everything is OK when running PLS algorithms with a Y matrix.

Versions

Linux-4.4.0-59-generic-x86_64-with-debian-stretch-sid
Python 3.5.1 |Anaconda custom (64-bit)| (default, Dec 7 2015, 11:16:01)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.12.0
SciPy 0.18.1
Scikit-Learn 0.18.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions