Skip to content

FIX ensure to return 1-D array with single target in PLS* and CCA#20355

Closed
ghost wants to merge 15 commits intomainfrom
unknown repository
Closed

FIX ensure to return 1-D array with single target in PLS* and CCA#20355
ghost wants to merge 15 commits intomainfrom
unknown repository

Conversation

@ghost
Copy link
Copy Markdown

@ghost ghost commented Jun 25, 2021

Reference Issues/PRs

Superseded #19409
Fixes #19352
Fixes #19409

What does this implement/fix? Explain your changes.

Fixes #19352 so IterativeImputer works with PLSRegression as estimator.

Any other comments?

  • Added test case in existing shape test.
  • test_impute.py coverage with 97%
  • Passed all imputation tests.

Copy link
Copy Markdown
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR @polocorona , this looks good but instead of changing the imputer, I think we should be updating the PLSRegression estimator as suggested in #19352 (comment)

@ghost
Copy link
Copy Markdown
Author

ghost commented Jun 25, 2021

@NicolasHug Done. Let me know if I understood your request correctly. 😄

@ghost ghost requested a review from NicolasHug June 25, 2021 20:04
Copy link
Copy Markdown
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @polocorona , this looks great!

I made some minor comments, and we could also add a test in test_pls.py to make sure that ndim == 1 for single-target regression.

Pinging @ogrisel @glemaitre @jjerphan as you reviewed #19409, I think this one is more correct.

@NicolasHug
Copy link
Copy Markdown
Member

NicolasHug commented Jun 28, 2021

BTW this doesn't affect just PLSRegression but all the estimators that inherit from _PLS so we should test and mention these ones as well

@ghost ghost changed the title [MRG] Fix IterativeImputer to work with PLSRegression as estimator [MRG] Fix _PLS class to properly return 1-d prediction array for single-target prediction Jun 28, 2021
@ghost ghost requested a review from NicolasHug June 29, 2021 03:53
Copy link
Copy Markdown
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @polocorona ! This LGTM with some minor comments

Copy link
Copy Markdown
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @polocorona for your contribution!

I just have one comment before approving.

@ghost ghost requested a review from jjerphan June 29, 2021 16:19
Copy link
Copy Markdown
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @polocorona!

@ghost
Copy link
Copy Markdown
Author

ghost commented Jun 30, 2021

@NicolasHug should I wait for another review for the merge?

@jjerphan
Copy link
Copy Markdown
Member

@polocorona: yes, two core-devs must give their approvals.

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I am not sure many people use PLS and CCA with 1d targets, I am worried about backward compat. See below:

X_test = np.random.randn(10, 3)
Y_pred = pls.predict(X_test)

assert Y_pred.shape == (10,)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried this is a breaking change for users who already have code running with 1d Y shaped as (n_samples, 1). Would it work for the imputation meta estimator if we made the estimator record the shape of Y at fit time to reuse it at predict time?

    X = np.random.randn(10, 3)
    Y = np.random.randn(10, 1)

    pls = Estimator(n_components=Y.shape[1])
    pls.fit(X, Y)

    X_test = np.random.randn(10, 3)
    Y_pred = pls.predict(X_test)

    assert Y_pred.shape == (10, 1)

    pls = Estimator(n_components=Y.shape[1])
    pls.fit(X,  np.ravel(Y))

    X_test = np.random.randn(10, 3)
    Y_pred = pls.predict(X_test)

    assert Y_pred.shape == (10,)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that'd be fine.

We might not need to explicitly record y.shape as it should be embedded already in self.y_weights_ and self.y_loadings_

But for what it's worth, we're not consistent at all w.r.t. the output shape when passing y with y.shape == (n_samples, 1). Here are a few estimators:

Estimator                     , output shape   , 'supports multioutput'
HistGradientBoostingClassifier, (100,)         , False
HistGradientBoostingRegressor , (100,)         , False
RandomForestClassifier        , (100,)         , True
RandomForestRegressor         , (100,)         , True
LogisticRegression            , (100,)         , False
LinearRegression              , (100, 1)       , True
SGDClassifier                 , (100,)         , False
SGDRegressor                  , (100,)         , False
SVC                           , (100,)         , False
PLSRegression                 , (100, 1)       , True

Those that don't support multioutput will warn with "DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel()."
It's expected that these ones will output (n_samples,), but for example the RandomForest estimators support multioutput and still ravel the predictions, unlike PLSRegression and LinearRegression.

For ref, the our regressors checks ensure that y_pred.shape == y.shape and our classifiers checks ensure that y_pred.shape == (n_samples,), but because of the data that these checks use, neither of them can detect such discrepancies.

@glemaitre glemaitre changed the title [MRG] Fix _PLS class to properly return 1-d prediction array for single-target prediction FIX ensure to return 1-D array with single target in PLS* and CCA Jul 22, 2021
Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @NicolasHug is right. We should however open a new issue and make sure to correct the common test as well.

leopoloc0 and others added 2 commits July 22, 2021 19:45
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@ghost ghost requested a review from glemaitre July 23, 2021 21:27
@glemaitre
Copy link
Copy Markdown
Member

Uhm so apparently, we have some 1D array that is coming before the squeeze. So it seems that we will need to have if Ypred.ndim == 2 condition then

@cmarmo cmarmo modified the milestone: 1.0.2 Nov 18, 2021
@cmarmo
Copy link
Copy Markdown
Contributor

cmarmo commented Nov 18, 2021

Hi @polocorona, thanks for your patience! Do you mind synchronizing with upstream to rerun the checks? Also please, could you move the changelog entry to 1.0.2?
In fact you already have two approvals now. @glemaitre is this mergeable once synchronized? Thanks!

@glemaitre
Copy link
Copy Markdown
Member

It is not straightforward indeed. It all depends if we agree to introduce a backwards-incompatible change.
I opened #20603 to check how many of these regressors we have issues with. Since that there are a couple of them, it might be advisable to have a deprecation cycle and make the IterativeImputer work with both cases meanwhile. I think that we should add this item to be discussed in the next developer meeting.

@cmarmo cmarmo added the Needs Decision Requires decision label Nov 22, 2021
@glemaitre glemaitre removed their request for review December 16, 2021 17:01
@ghost ghost closed this by deleting the head repository Aug 14, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Interactive Imputer cannot accept PLSRegression() as an estimator due to "shape mismatch"

6 participants