DOC Improve `plot_target_encoder_cross_val.py` example by lucyleeow · Pull Request #26677 · scikit-learn/scikit-learn

lucyleeow · 2023-06-23T06:43:25Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Fixes some typos, changes some formatting and wording and adds titles and axis labels to all graphs.

Any other comments?

cc @thomasjpfan

github-actions · 2023-06-23T06:45:19Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 6cfaeb5. Link to the linter CI: here}

ArturoAmorQ

Thanks for the PR, @lucyleeow! Here are a couple of comments.

examples/preprocessing/plot_target_encoder_cross_val.py

ArturoAmorQ · 2023-06-23T15:06:08Z

examples/preprocessing/plot_target_encoder_cross_val.py

-_ = coefs_cv.plot(kind="barh")
+ax = coefs_cv.plot(kind="barh")
+_ = ax.set(
+    title="Target encoded with cross validation",


Here (and in later in this PR) we need to use the term cross-fitting and not cross-validation. The difference being that no validation (computation of a score) is computed in the target encoder (see e.g. https://arxiv.org/pdf/2007.02852.pdf).

Maybe we can explain the term earlier in the example, and even add an entry to the glossary and link it from here.

Good idea. Will have a read then add to glossary

ogrisel

Thanks for this PR. Cross-linking to a new entry in the glossary is indeed a very good idea.

ogrisel · 2023-06-26T09:03:23Z

doc/glossary.rst

+        'train' and 'estimation' subsets. The 'estimation' subset may be used
+        to estimate a parameter or predict a target. The outputs from
+        each iteration are combined and used in a downstream prcess, generally
+        to avoid bias.


Maybe it would be useful to cross-link to typical (meta-)estimators that implement this strategy (e.g. StackingRegressor/Classifier and TargetEncoder, CalibratedClassifierCV...).

Note: the current implementation of CalibratedClassifierCV is a bit different from the others because it does not attempt to concatenate the predictions of the first stage to fit the second stage (but instead fits one second stage estimator per cross-fitting iteration and then averages the predictions of the second stage model). Still, I think those are morally all examples of a cross-fitting strategy.

Another example not yet implemented in scikit-learn are honest trees and forests where the values of the leafs are estimated on different training samples that those used to learn the decision splits of the trees.

Good idea, I meant to do this but forgot!

ogrisel · 2023-06-26T09:23:22Z

doc/glossary.rst

+        A resampling method that iteratively partitions data into complementary
+        'train' and 'estimation' subsets. The 'estimation' subset may be used
+        to estimate a parameter or predict a target. The outputs from
+        each iteration are combined and used in a downstream prcess, generally
+        to avoid bias.


I don't find the "train" and "estimation" names very clear because one could argue that training and estimating are similar. Maybe it would be helpful to speak in terms of first and second stage as follows:

Suggested change

A resampling method that iteratively partitions data into complementary

'train' and 'estimation' subsets. The 'estimation' subset may be used

to estimate a parameter or predict a target. The outputs from

each iteration are combined and used in a downstream prcess, generally

to avoid bias.

A resampling method that iteratively partitions data into mutually exclusive

subsets to fit a two stage estimator. The second stage is fit on predictions

of the first stage computed on a subset of data not seen when training the

first stage. The objective is to avoid having any overfitting in the first

stage introduce a bias in the training data distribution of the second stage.

'mutually exclusive' is much clearer and more explicit than 'complementary', have used this in the cross validation entry too, thanks!
Is TargetEncoder technically a 2 stage estimator? There isn't really a second stage, it just outputs the transformed X values for use by another estimator...? I have thus tried to make it more general and re-worded. Not entirely happy though and open to changes. Also I may have mis-understood some aspects!

TargetEncoder is the first stage and the HistGradientBoosting model (or anything else) you train on the output of TargetEncoder is the second stage.

For stacking and calibration with use meta-estimator, whereas for TargetEncoder we use a transformer pipeline and implement the cross-fitting only in the fit_transform of the first stage transformer.

Thanks, yes that is my understanding too.

Now I realise my mis-understanding is over what 'two stage estimator' means. Initially I took it to mean that a single estimator that performs two/both stages, which none of the estimators do (CalibratedClassifierCV technically takes a estimator argument which does the first stage). Whereas you meant 'two stage estimator' as an estimator that is designed to perform a 2 stage process. What about changing the wording to be something like:
"to fit a two stage estimator chain" or "to fit an estimator in a two stage process/model" ??

(Just to clarify to prevent the uninitiated, like me, from mis-understanding?)

ping @ogrisel , would either suggestion be suitable?

ogrisel · 2023-06-26T09:27:01Z

examples/preprocessing/plot_target_encoder_cross_val.py

+# We evaluate the model that did not use :term:`cross fitting` when encoding and
+# see that it overfits:
 print(
    "Model without CV on training set: ",


Maybe even here:

Suggested change

"Model without CV on training set: ",

"Model trained without cross-fitting: ",

and similar change in other print statements.

ogrisel · 2023-06-26T09:28:16Z

examples/preprocessing/plot_target_encoder_cross_val.py

+# see that it overfits:
 print(
    "Model without CV on training set: ",
    model_no_cv.score(X_train_no_cv_encoding, y_train),


Not sure if it would help to also rename the variables (e.f. model_without_crossfitting / X_train_no_cf_encoding).

As you wish.

doc/glossary.rst

ogrisel · 2023-06-26T09:34:39Z

doc/glossary.rst

+        unseen data. This conserves data as avoids the need to hold out a
+        'validation' dataset and accounts for variability as multiple rounds of
+        cross validation are genreally performed.
+        See :ref:`User Guide <_cross_validation>` for more details.


Suggested change

See :ref:`User Guide <_cross_validation>` for more details.

See the :ref:`user guide <cross_validation>` for more details.

ogrisel · 2023-06-26T09:35:29Z

examples/preprocessing/plot_target_encoder_cross_val.py

+    encode="ordinal",
+    strategy="uniform",
+    random_state=rng,
+    subsample=None,


Why is disabling subsampling here?

Sorry, should have explained. This was just to avoid the FutureWarning:

/home/circleci/project/sklearn/preprocessing/_discretization.py:239: FutureWarning: In version 1.5 onwards, subsample=200_000 will be used by default. Set subsample explicitly to silence this warning in the mean time. Set subsample=None to disable subsampling explicitly.

Was not sure of the original intent, I think current default (what is currently used in the example) is None, so have used this but happy to change to new default 200_000 ?

ArturoAmorQ

Just a couple of comments. Otherwise LGTM.

ArturoAmorQ · 2023-06-26T09:30:29Z

doc/glossary.rst

+        unseen data. This conserves data as avoids the need to hold out a
+        'validation' dataset and accounts for variability as multiple rounds of
+        cross validation are genreally performed.
+        See :ref:`User Guide <_cross_validation>` for more details.


The CI is failing due to the first underscore.

Suggested change

See :ref:`User Guide <_cross_validation>` for more details.

See the :ref:`User Guide <cross_validation>` for more details.

ArturoAmorQ · 2023-06-26T09:32:06Z

doc/glossary.rst

+        A resampling method that iteratively partitions data into complementary
+        'train' and 'estimation' subsets. The 'estimation' subset may be used
+        to estimate a parameter or predict a target. The outputs from
+        each iteration are combined and used in a downstream prcess, generally


Suggested change

each iteration are combined and used in a downstream prcess, generally

each iteration are combined and used in a downstream process, generally

ArturoAmorQ · 2023-06-26T09:45:38Z

examples/preprocessing/plot_target_encoder_cross_val.py

    model_no_cv.coef_, index=model_no_cv.feature_names_in_
 ).sort_values()
-_ = coefs_no_cv.plot(kind="barh")
+ax = coefs_cv.plot(kind="barh")


Suggested change

ax = coefs_cv.plot(kind="barh")

ax = coefs_no_cv.plot(kind="barh")

Thanks, good pick up!

thomasjpfan

Although I have not seen it in ML literature, I like the introduction of the term "cross fitting". LGTM

thomasjpfan · 2023-07-27T13:52:00Z

I think the comments from @ogrisel is addressed. Specifically, the updating wording for "cross-fitting" addresses the open comment in #26677 (comment).

With that I am going to merge, I think the rewording in this PR is really useful and we can iterate from here.

…26677)

lucyleeow added 3 commits June 21, 2023 17:08

improve example

b24874a

merge main

3358e20

typo

da1a213

github-actions bot added the Documentation label Jun 23, 2023

lucyleeow added 4 commits June 23, 2023 16:52

fix formatting

39515c1

black

65e5fe0

emppty

b5c1de7

merge main

f2494ba

ArturoAmorQ reviewed Jun 23, 2023

View reviewed changes

lucyleeow added 3 commits June 24, 2023 15:04

review

4f04dd3

add terms to glossary, add cross refs

2db8095

formatting

800c33a

ogrisel reviewed Jun 26, 2023

View reviewed changes

ArturoAmorQ approved these changes Jun 26, 2023

View reviewed changes

lucyleeow added 4 commits June 27, 2023 10:33

Merge branch 'main' into doc_te_cv_example

e8685e2

review

753a42e

try fix ci

b55c056

revert

f7c040a

lucyleeow mentioned this pull request Jun 27, 2023

MAINT: Sphinx gallery bug after Sphinx min dependency increased to 6.0.0 #26709

Closed

merge main

6cfaeb5

thomasjpfan approved these changes Jul 27, 2023

View reviewed changes

thomasjpfan merged commit 8f63882 into scikit-learn:main Jul 27, 2023

lucyleeow deleted the doc_te_cv_example branch July 28, 2023 00:18

punndcoder28 pushed a commit to punndcoder28/scikit-learn that referenced this pull request Jul 29, 2023

DOC Improve plot_target_encoder_cross_val.py example (scikit-learn#…

8e0a5a6

…26677)

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 18, 2023

DOC Improve plot_target_encoder_cross_val.py example (scikit-learn#…

de1b8ca

…26677)

jeremiedbb pushed a commit that referenced this pull request Sep 20, 2023

DOC Improve plot_target_encoder_cross_val.py example (#26677)

da589e8

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

DOC Improve plot_target_encoder_cross_val.py example (scikit-learn#…

14089c3

…26677)

	"Model without CV on training set: ",
	"Model trained without cross-fitting: ",

	See :ref:`User Guide <_cross_validation>` for more details.
	See the :ref:`user guide <cross_validation>` for more details.

	each iteration are combined and used in a downstream prcess, generally
	each iteration are combined and used in a downstream process, generally

	ax = coefs_cv.plot(kind="barh")
	ax = coefs_no_cv.plot(kind="barh")

Uh oh!

Conversation

lucyleeow commented Jun 23, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ArturoAmorQ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow Jun 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow Jul 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArturoAmorQ left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Jul 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

github-actions bot commented Jun 23, 2023 •

edited

Loading

lucyleeow Jun 24, 2023 •

edited

Loading

lucyleeow Jul 3, 2023 •

edited

Loading

lucyleeow Jun 27, 2023 •

edited

Loading