Skip to content

ENH: Display the number and names of output features#31937

Open
DeaMariaLeon wants to merge 198 commits intoscikit-learn:mainfrom
DeaMariaLeon:features2
Open

ENH: Display the number and names of output features#31937
DeaMariaLeon wants to merge 198 commits intoscikit-learn:mainfrom
DeaMariaLeon:features2

Conversation

@DeaMariaLeon
Copy link
Copy Markdown
Member

@DeaMariaLeon DeaMariaLeon commented Aug 13, 2025

Reference Issues/PRs

Towards #26595

Any other comments?

Example
Screenshot 2025-08-20 at 18 14 43

@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 13, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 2005a4e. Link to the linter CI: here

@DeaMariaLeon DeaMariaLeon changed the title WIP: Display the shape of outgoing data structures WIP: Display the number of outgoing data structures Aug 18, 2025
@DeaMariaLeon DeaMariaLeon changed the title WIP: Display the number of outgoing data structures WIP: Display the number of output features Aug 18, 2025
@DeaMariaLeon DeaMariaLeon marked this pull request as ready for review August 21, 2025 09:03
@DeaMariaLeon
Copy link
Copy Markdown
Member Author

I wonder if I can have feedback before I add/fix more tests.
@glemaitre

@DeaMariaLeon DeaMariaLeon changed the title WIP: Display the number of output features ENH: Display the number of output features Aug 21, 2025
@glemaitre glemaitre self-requested a review August 22, 2025 09:03
@glemaitre
Copy link
Copy Markdown
Member

I see that we have to handle one specific case:

image

We have an internal PassThrough transformer that forward the input feature as-is and this it means that we should mention that the output feature are the same as n_features_in_.

@jeremiedbb
Copy link
Copy Markdown
Member

I find the block a bit big, it takes as much space as the estimator itself. I was also thinking that having the input features would be nice but then it really starts to take a lot of space around the estimator. So I wondered if the features could be intermediate blocks in the diagram, representing both the output features from the previous estimator and the input features for the next estimator. Something like this
output_features

This way the diagram alternates estimator blocks and data blocks
Then the text would be different obviously, like "16 features", or even the full shape ?

In addition, in this PR or in a following one, the data block could be unfold to show the feature names if available.

@glemaitre
Copy link
Copy Markdown
Member

One feedback of @ogrisel IRL is to directly show the feature names using the same pattern than "Parameters".

I personally agree with @jeremiedbb feedback: I would like something smaller. Also write now, we have to mention "output features" instead of simply "features" because of the ambiguity input/output when attached to the estimator. So the proposal to make the "feature" being blocks leaving on their own is nice I think because there is not ambiguity anymore.

@DeaMariaLeon
Copy link
Copy Markdown
Member Author

I'll work on this, thanks for the feedback. Just:

One feedback of @ogrisel IRL is to directly show the feature names using the same pattern than "Parameters".

Should I add the feature names on this PR? I remember @glemaitre saying that they should be added on a separate PR.

@glemaitre
Copy link
Copy Markdown
Member

Should I add the feature names on this PR?

I want to dissociate it at first but since we are going to create a new block, it might be better to have directly the feature names as well.

@DeaMariaLeon
Copy link
Copy Markdown
Member Author

DeaMariaLeon commented Mar 18, 2026

Hi @DeaMariaLeon, here is first pass of review (I haven't looked at the tests yet).

I don't see any changes made to plot_column_transformer_mixed_types. Could you try set_output(transform="pandas") as suggested here #31937 (comment) and as you have already done for plot_cyclical_feature_engineering ?

I thought that his comment was just an explanation on the question you had. I didn't understand I should actually make the change. I'll do it.

EDIT: Do you know how to do that? I either get an error, or keep getting just the "x0, x4, x5" etc. @antoinebaker

@DeaMariaLeon
Copy link
Copy Markdown
Member Author

I did reply to #31937 (comment), but it's only visible on github's "Files changed".

@DeaMariaLeon
Copy link
Copy Markdown
Member Author

DeaMariaLeon commented Mar 19, 2026

In this comment: #31937 (comment)
I wrote an "EDIT" that may be difficult to see. So I'll add that here just in case:

Hi @DeaMariaLeon, here is first pass of review (I haven't looked at the tests yet).

I don't see any changes made to plot_column_transformer_mixed_types. Could you try set_output(transform="pandas") as suggested here #31937 (comment) and as you have already done for plot_cyclical_feature_engineering ?

Me:

I thought that his comment was just an explanation on the question you had. I didn't understand I should actually make the change. I'll do it.

Me again:
Do you know how to do that? I either get an error, or keep getting just the "x0, x4, x5" etc. @antoinebaker

@antoinebaker
Copy link
Copy Markdown
Contributor

antoinebaker commented Mar 19, 2026

Do you know how to do that? I either get an error, or keep getting just the "x0, x4, x5" etc. @antoinebaker

Well if the set_output(transform="pandas") does not work, don't bother :) Could you instead create a new issue with a screeenshot for this example ? (after this PR is merged)

@DeaMariaLeon
Copy link
Copy Markdown
Member Author

Well if the set_output(transform="pandas") does not work, don't bother :) Could you instead create a new issue with a screeenshot for this example ? (after this PR is merged)

Looking at this again, I fail to see the issue. On that particular example (plot_column_transformer_mixed_types), the input of SelectPercentile are not the original features. Its inputs are already transformed by OneHotEncoder, so how can it give the names of the original columns? I think that what it shows is correct, but I may be missing something.

@antoinebaker
Copy link
Copy Markdown
Contributor

antoinebaker commented Mar 20, 2026

Its inputs are already transformed by OneHotEncoder, so how can it give the names of the original columns?

Not the original names but the names output by the OneHotEncoder:

sex, pclass -> OneHotEncoder -> sex_female, sex_male, pclass_1, pclass_2, ... -> SelectPercentile -> sex_female, sex_male, pclass_1, pclass_3

which makes the preprocessing much easier to follow that "anonymous" column names such as x0, x1, ...

Capture d’écran 2026-03-20 à 10 12 00

@DeaMariaLeon
Copy link
Copy Markdown
Member Author

You are right, it would be easier.

@DeaMariaLeon
Copy link
Copy Markdown
Member Author

But it works:
Screenshot 2026-03-20 at 12 35 56

@DeaMariaLeon
Copy link
Copy Markdown
Member Author

DeaMariaLeon commented Mar 20, 2026

I mean: it works with a small change to the example.

EDIT (note for myself): There was a known issue with OneHotEncoder, and needed to set sparse=False to it on the example. That way one can use set_output(transform="pandas") and don't brake (the example).

@DeaMariaLeon
Copy link
Copy Markdown
Member Author

@DeaMariaLeon
Copy link
Copy Markdown
Member Author

DeaMariaLeon commented Mar 20, 2026

Up to here, I think I have added all the feedback from @antoinebaker except #31937 (comment) because of the circular import.

EDIT: Imported ColumnTransformer as suggested.

Copy link
Copy Markdown
Contributor

@antoinebaker antoinebaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @DeaMariaLeon! LGTM.

@DeaMariaLeon
Copy link
Copy Markdown
Member Author

Thanks @antoinebaker!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants