Skip to content

Confusing pretty print repr for nested Pipeline #13372

@jorisvandenbossche

Description

@jorisvandenbossche

Taking the examples from the docs (https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py) that involves some nested pipelines in columntransformer in pipeline

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

The repr that you get for this pipeline:

In [8]: clf
Out[8]: 
Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipe...cept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))])

which I found very confusing: the outer pipeline seems to have only 1 step (the 'preprocessor', as the 'classifier' disappeared in the ...).

It's probably certainly not easy to get a good repr in all cases, and for sure the old behaviour was even worse (it would show the first 'imputer' step of the pipeline inside the column transformer as if it was the second step of the outer pipeline ..). But just opening this issue as a data point for possible improvements.

Without knowing how the current repr is determined: ideally I would expect that, if the full repr is too long, we first try to trim it step per step of the outer pipeline, so that the structure of that outer pipeline is still visible. But that is easier to write than to code .. :)

cc @NicolasHug

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions