DOC Improve plot_target_encoder_cross_val.py example#26677
DOC Improve plot_target_encoder_cross_val.py example#26677thomasjpfan merged 15 commits intoscikit-learn:mainfrom
plot_target_encoder_cross_val.py example#26677Conversation
ArturoAmorQ
left a comment
There was a problem hiding this comment.
Thanks for the PR, @lucyleeow! Here are a couple of comments.
| _ = coefs_cv.plot(kind="barh") | ||
| ax = coefs_cv.plot(kind="barh") | ||
| _ = ax.set( | ||
| title="Target encoded with cross validation", |
There was a problem hiding this comment.
Here (and in later in this PR) we need to use the term cross-fitting and not cross-validation. The difference being that no validation (computation of a score) is computed in the target encoder (see e.g. https://arxiv.org/pdf/2007.02852.pdf).
Maybe we can explain the term earlier in the example, and even add an entry to the glossary and link it from here.
There was a problem hiding this comment.
Good idea. Will have a read then add to glossary
ogrisel
left a comment
There was a problem hiding this comment.
Thanks for this PR. Cross-linking to a new entry in the glossary is indeed a very good idea.
doc/glossary.rst
Outdated
| 'train' and 'estimation' subsets. The 'estimation' subset may be used | ||
| to estimate a parameter or predict a target. The outputs from | ||
| each iteration are combined and used in a downstream prcess, generally | ||
| to avoid bias. |
There was a problem hiding this comment.
Maybe it would be useful to cross-link to typical (meta-)estimators that implement this strategy (e.g. StackingRegressor/Classifier and TargetEncoder, CalibratedClassifierCV...).
Note: the current implementation of CalibratedClassifierCV is a bit different from the others because it does not attempt to concatenate the predictions of the first stage to fit the second stage (but instead fits one second stage estimator per cross-fitting iteration and then averages the predictions of the second stage model). Still, I think those are morally all examples of a cross-fitting strategy.
Another example not yet implemented in scikit-learn are honest trees and forests where the values of the leafs are estimated on different training samples that those used to learn the decision splits of the trees.
There was a problem hiding this comment.
Good idea, I meant to do this but forgot!
doc/glossary.rst
Outdated
| A resampling method that iteratively partitions data into complementary | ||
| 'train' and 'estimation' subsets. The 'estimation' subset may be used | ||
| to estimate a parameter or predict a target. The outputs from | ||
| each iteration are combined and used in a downstream prcess, generally | ||
| to avoid bias. |
There was a problem hiding this comment.
I don't find the "train" and "estimation" names very clear because one could argue that training and estimating are similar. Maybe it would be helpful to speak in terms of first and second stage as follows:
| A resampling method that iteratively partitions data into complementary | |
| 'train' and 'estimation' subsets. The 'estimation' subset may be used | |
| to estimate a parameter or predict a target. The outputs from | |
| each iteration are combined and used in a downstream prcess, generally | |
| to avoid bias. | |
| A resampling method that iteratively partitions data into mutually exclusive | |
| subsets to fit a two stage estimator. The second stage is fit on predictions | |
| of the first stage computed on a subset of data not seen when training the | |
| first stage. The objective is to avoid having any overfitting in the first | |
| stage introduce a bias in the training data distribution of the second stage. |
There was a problem hiding this comment.
'mutually exclusive' is much clearer and more explicit than 'complementary', have used this in the cross validation entry too, thanks!
Is TargetEncoder technically a 2 stage estimator? There isn't really a second stage, it just outputs the transformed X values for use by another estimator...? I have thus tried to make it more general and re-worded. Not entirely happy though and open to changes. Also I may have mis-understood some aspects!
There was a problem hiding this comment.
TargetEncoder is the first stage and the HistGradientBoosting model (or anything else) you train on the output of TargetEncoder is the second stage.
For stacking and calibration with use meta-estimator, whereas for TargetEncoder we use a transformer pipeline and implement the cross-fitting only in the fit_transform of the first stage transformer.
There was a problem hiding this comment.
Thanks, yes that is my understanding too.
Now I realise my mis-understanding is over what 'two stage estimator' means. Initially I took it to mean that a single estimator that performs two/both stages, which none of the estimators do (CalibratedClassifierCV technically takes a estimator argument which does the first stage). Whereas you meant 'two stage estimator' as an estimator that is designed to perform a 2 stage process. What about changing the wording to be something like:
"to fit a two stage estimator chain" or "to fit an estimator in a two stage process/model" ??
(Just to clarify to prevent the uninitiated, like me, from mis-understanding?)
There was a problem hiding this comment.
ping @ogrisel , would either suggestion be suitable?
| # We evaluate the model that did not use :term:`cross fitting` when encoding and | ||
| # see that it overfits: | ||
| print( | ||
| "Model without CV on training set: ", |
There was a problem hiding this comment.
Maybe even here:
| "Model without CV on training set: ", | |
| "Model trained without cross-fitting: ", |
There was a problem hiding this comment.
and similar change in other print statements.
| # see that it overfits: | ||
| print( | ||
| "Model without CV on training set: ", | ||
| model_no_cv.score(X_train_no_cv_encoding, y_train), |
There was a problem hiding this comment.
Not sure if it would help to also rename the variables (e.f. model_without_crossfitting / X_train_no_cf_encoding).
As you wish.
doc/glossary.rst
Outdated
| unseen data. This conserves data as avoids the need to hold out a | ||
| 'validation' dataset and accounts for variability as multiple rounds of | ||
| cross validation are genreally performed. | ||
| See :ref:`User Guide <_cross_validation>` for more details. |
There was a problem hiding this comment.
| See :ref:`User Guide <_cross_validation>` for more details. | |
| See the :ref:`user guide <cross_validation>` for more details. |
| encode="ordinal", | ||
| strategy="uniform", | ||
| random_state=rng, | ||
| subsample=None, |
There was a problem hiding this comment.
Why is disabling subsampling here?
There was a problem hiding this comment.
Sorry, should have explained. This was just to avoid the FutureWarning:
/home/circleci/project/sklearn/preprocessing/_discretization.py:239: FutureWarning:
In version 1.5 onwards, subsample=200_000 will be used by default. Set subsample explicitly to silence this warning in the mean time. Set subsample=None to disable subsampling explicitly.
Was not sure of the original intent, I think current default (what is currently used in the example) is None, so have used this but happy to change to new default 200_000 ?
ArturoAmorQ
left a comment
There was a problem hiding this comment.
Just a couple of comments. Otherwise LGTM.
doc/glossary.rst
Outdated
| unseen data. This conserves data as avoids the need to hold out a | ||
| 'validation' dataset and accounts for variability as multiple rounds of | ||
| cross validation are genreally performed. | ||
| See :ref:`User Guide <_cross_validation>` for more details. |
There was a problem hiding this comment.
The CI is failing due to the first underscore.
| See :ref:`User Guide <_cross_validation>` for more details. | |
| See the :ref:`User Guide <cross_validation>` for more details. |
doc/glossary.rst
Outdated
| A resampling method that iteratively partitions data into complementary | ||
| 'train' and 'estimation' subsets. The 'estimation' subset may be used | ||
| to estimate a parameter or predict a target. The outputs from | ||
| each iteration are combined and used in a downstream prcess, generally |
There was a problem hiding this comment.
| each iteration are combined and used in a downstream prcess, generally | |
| each iteration are combined and used in a downstream process, generally |
| model_no_cv.coef_, index=model_no_cv.feature_names_in_ | ||
| ).sort_values() | ||
| _ = coefs_no_cv.plot(kind="barh") | ||
| ax = coefs_cv.plot(kind="barh") |
There was a problem hiding this comment.
| ax = coefs_cv.plot(kind="barh") | |
| ax = coefs_no_cv.plot(kind="barh") |
thomasjpfan
left a comment
There was a problem hiding this comment.
Although I have not seen it in ML literature, I like the introduction of the term "cross fitting". LGTM
|
I think the comments from @ogrisel is addressed. Specifically, the updating wording for "cross-fitting" addresses the open comment in #26677 (comment). With that I am going to merge, I think the rewording in this PR is really useful and we can iterate from here. |
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Fixes some typos, changes some formatting and wording and adds titles and axis labels to all graphs.
Any other comments?
cc @thomasjpfan