-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Confusing pretty print repr for nested Pipeline #13372
Description
Taking the examples from the docs (https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py) that involves some nested pipelines in columntransformer in pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])The repr that you get for this pipeline:
In [8]: clf
Out[8]:
Pipeline(memory=None,
steps=[('preprocessor',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipe...cept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None,
penalty='l2', random_state=None,
solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False))])which I found very confusing: the outer pipeline seems to have only 1 step (the 'preprocessor', as the 'classifier' disappeared in the ...).
It's probably certainly not easy to get a good repr in all cases, and for sure the old behaviour was even worse (it would show the first 'imputer' step of the pipeline inside the column transformer as if it was the second step of the outer pipeline ..). But just opening this issue as a data point for possible improvements.
Without knowing how the current repr is determined: ideally I would expect that, if the full repr is too long, we first try to trim it step per step of the outer pipeline, so that the structure of that outer pipeline is still visible. But that is easier to write than to code .. :)
cc @NicolasHug