Skip to content

DOC Improve plot_target_encoder_cross_val.py example#26677

Merged
thomasjpfan merged 15 commits intoscikit-learn:mainfrom
lucyleeow:doc_te_cv_example
Jul 27, 2023
Merged

DOC Improve plot_target_encoder_cross_val.py example#26677
thomasjpfan merged 15 commits intoscikit-learn:mainfrom
lucyleeow:doc_te_cv_example

Conversation

@lucyleeow
Copy link
Copy Markdown
Member

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Fixes some typos, changes some formatting and wording and adds titles and axis labels to all graphs.

Any other comments?

cc @thomasjpfan

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 23, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6cfaeb5. Link to the linter CI: here

Copy link
Copy Markdown
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @lucyleeow! Here are a couple of comments.

_ = coefs_cv.plot(kind="barh")
ax = coefs_cv.plot(kind="barh")
_ = ax.set(
title="Target encoded with cross validation",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here (and in later in this PR) we need to use the term cross-fitting and not cross-validation. The difference being that no validation (computation of a score) is computed in the target encoder (see e.g. https://arxiv.org/pdf/2007.02852.pdf).

Maybe we can explain the term earlier in the example, and even add an entry to the glossary and link it from here.

Copy link
Copy Markdown
Member Author

@lucyleeow lucyleeow Jun 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Will have a read then add to glossary

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. Cross-linking to a new entry in the glossary is indeed a very good idea.

doc/glossary.rst Outdated
'train' and 'estimation' subsets. The 'estimation' subset may be used
to estimate a parameter or predict a target. The outputs from
each iteration are combined and used in a downstream prcess, generally
to avoid bias.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be useful to cross-link to typical (meta-)estimators that implement this strategy (e.g. StackingRegressor/Classifier and TargetEncoder, CalibratedClassifierCV...).

Note: the current implementation of CalibratedClassifierCV is a bit different from the others because it does not attempt to concatenate the predictions of the first stage to fit the second stage (but instead fits one second stage estimator per cross-fitting iteration and then averages the predictions of the second stage model). Still, I think those are morally all examples of a cross-fitting strategy.

Another example not yet implemented in scikit-learn are honest trees and forests where the values of the leafs are estimated on different training samples that those used to learn the decision splits of the trees.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I meant to do this but forgot!

doc/glossary.rst Outdated
Comment on lines +210 to +214
A resampling method that iteratively partitions data into complementary
'train' and 'estimation' subsets. The 'estimation' subset may be used
to estimate a parameter or predict a target. The outputs from
each iteration are combined and used in a downstream prcess, generally
to avoid bias.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find the "train" and "estimation" names very clear because one could argue that training and estimating are similar. Maybe it would be helpful to speak in terms of first and second stage as follows:

Suggested change
A resampling method that iteratively partitions data into complementary
'train' and 'estimation' subsets. The 'estimation' subset may be used
to estimate a parameter or predict a target. The outputs from
each iteration are combined and used in a downstream prcess, generally
to avoid bias.
A resampling method that iteratively partitions data into mutually exclusive
subsets to fit a two stage estimator. The second stage is fit on predictions
of the first stage computed on a subset of data not seen when training the
first stage. The objective is to avoid having any overfitting in the first
stage introduce a bias in the training data distribution of the second stage.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'mutually exclusive' is much clearer and more explicit than 'complementary', have used this in the cross validation entry too, thanks!
Is TargetEncoder technically a 2 stage estimator? There isn't really a second stage, it just outputs the transformed X values for use by another estimator...? I have thus tried to make it more general and re-worded. Not entirely happy though and open to changes. Also I may have mis-understood some aspects!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TargetEncoder is the first stage and the HistGradientBoosting model (or anything else) you train on the output of TargetEncoder is the second stage.

For stacking and calibration with use meta-estimator, whereas for TargetEncoder we use a transformer pipeline and implement the cross-fitting only in the fit_transform of the first stage transformer.

Copy link
Copy Markdown
Member Author

@lucyleeow lucyleeow Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes that is my understanding too.

Now I realise my mis-understanding is over what 'two stage estimator' means. Initially I took it to mean that a single estimator that performs two/both stages, which none of the estimators do (CalibratedClassifierCV technically takes a estimator argument which does the first stage). Whereas you meant 'two stage estimator' as an estimator that is designed to perform a 2 stage process. What about changing the wording to be something like:
"to fit a two stage estimator chain" or "to fit an estimator in a two stage process/model" ??

(Just to clarify to prevent the uninitiated, like me, from mis-understanding?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @ogrisel , would either suggestion be suitable?

# We evaluate the model that did not use :term:`cross fitting` when encoding and
# see that it overfits:
print(
"Model without CV on training set: ",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even here:

Suggested change
"Model without CV on training set: ",
"Model trained without cross-fitting: ",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and similar change in other print statements.

# see that it overfits:
print(
"Model without CV on training set: ",
model_no_cv.score(X_train_no_cv_encoding, y_train),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it would help to also rename the variables (e.f. model_without_crossfitting / X_train_no_cf_encoding).

As you wish.

doc/glossary.rst Outdated
unseen data. This conserves data as avoids the need to hold out a
'validation' dataset and accounts for variability as multiple rounds of
cross validation are genreally performed.
See :ref:`User Guide <_cross_validation>` for more details.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See :ref:`User Guide <_cross_validation>` for more details.
See the :ref:`user guide <cross_validation>` for more details.

encode="ordinal",
strategy="uniform",
random_state=rng,
subsample=None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is disabling subsampling here?

Copy link
Copy Markdown
Member Author

@lucyleeow lucyleeow Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should have explained. This was just to avoid the FutureWarning:

/home/circleci/project/sklearn/preprocessing/_discretization.py:239: FutureWarning:

In version 1.5 onwards, subsample=200_000 will be used by default. Set subsample explicitly to silence this warning in the mean time. Set subsample=None to disable subsampling explicitly.

Was not sure of the original intent, I think current default (what is currently used in the example) is None, so have used this but happy to change to new default 200_000 ?

Copy link
Copy Markdown
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of comments. Otherwise LGTM.

doc/glossary.rst Outdated
unseen data. This conserves data as avoids the need to hold out a
'validation' dataset and accounts for variability as multiple rounds of
cross validation are genreally performed.
See :ref:`User Guide <_cross_validation>` for more details.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI is failing due to the first underscore.

Suggested change
See :ref:`User Guide <_cross_validation>` for more details.
See the :ref:`User Guide <cross_validation>` for more details.

doc/glossary.rst Outdated
A resampling method that iteratively partitions data into complementary
'train' and 'estimation' subsets. The 'estimation' subset may be used
to estimate a parameter or predict a target. The outputs from
each iteration are combined and used in a downstream prcess, generally
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
each iteration are combined and used in a downstream prcess, generally
each iteration are combined and used in a downstream process, generally

model_no_cv.coef_, index=model_no_cv.feature_names_in_
).sort_values()
_ = coefs_no_cv.plot(kind="barh")
ax = coefs_cv.plot(kind="barh")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ax = coefs_cv.plot(kind="barh")
ax = coefs_no_cv.plot(kind="barh")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good pick up!

Copy link
Copy Markdown
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I have not seen it in ML literature, I like the introduction of the term "cross fitting". LGTM

@thomasjpfan
Copy link
Copy Markdown
Member

I think the comments from @ogrisel is addressed. Specifically, the updating wording for "cross-fitting" addresses the open comment in #26677 (comment).

With that I am going to merge, I think the rewording in this PR is really useful and we can iterate from here.

@thomasjpfan thomasjpfan merged commit 8f63882 into scikit-learn:main Jul 27, 2023
@lucyleeow lucyleeow deleted the doc_te_cv_example branch July 28, 2023 00:18
punndcoder28 pushed a commit to punndcoder28/scikit-learn that referenced this pull request Jul 29, 2023
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 18, 2023
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants