Skip to content

DOC Rework Decision boundary of semi-supervised methods example#32024

Merged
StefanieSenger merged 13 commits intoscikit-learn:mainfrom
ArturoAmorQ:rework_semi_supervised
Oct 5, 2025
Merged

DOC Rework Decision boundary of semi-supervised methods example#32024
StefanieSenger merged 13 commits intoscikit-learn:mainfrom
ArturoAmorQ:rework_semi_supervised

Conversation

@ArturoAmorQ
Copy link
Copy Markdown
Member

@ArturoAmorQ ArturoAmorQ commented Aug 27, 2025

Reference Issues/PRs

See also #31625.

What does this implement/fix? Explain your changes.

Related to the series of examples I am reworking, this PR:

  • General clean up (some inline comments didn't even hold);
  • Implements notebook-tutorial style;
  • Uses DecisionBoundaryDisplay instead of hard-coding the decision boundary;
  • Plots predict_proba instead of hard predictions;
  • Changes the proportions of labeled data to better demonstrate that a few labeled points suffice;
  • Adds interpretation to those plots;
  • Adds section to explain how predict_proba works for both methods.

Any other comments?

I am aware that removing this example was suggested in #31499 (comment), but I think it can still provide value in terms of visualizing probabilities, which is something that cannot be done in Semi-supervised Classification on a Text Dataset, as argued in said discussion.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 27, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5dc9216. Link to the linter CI: here

Copy link
Copy Markdown
Member

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this awesome enhancement of the example, @ArturoAmorQ! 💟

I really learned a lot while going through it and especially the section on " predict_proba in LabelSpreading" clicked much better with me than what we have in the user guide.

Maybe the last paragraph where LabelSpreading and SelfTrainingClassifier are compared, we could also mention that in LabelSpreading, predictions (including predict_proba) depend on the training set (that keeps being stored), whereas with SelfTrainingClassifier (and SVC as a base estimator) the model learns a decision rule that exists independently of the training data after fitting (and is thus more abstract/better generalisable?).

Apart from that I have only commented on some nits.

plt.title(title)

plt.suptitle("Unlabeled points are colored white", y=0.1)
rbf_svc = (base_classifier.fit(X, y), y, "SVC with rbf kernel (100% labeled data)")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe:

Suggested change
rbf_svc = (base_classifier.fit(X, y), y, "SVC with rbf kernel (100% labeled data)")
rbf_svc = (base_classifier.fit(X, y), y, "Self-training with 100% labeled data (equivalent to SVC with rbf kernel)")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the other way around, right?
"SVC with rbf kernel and 100% labeled data (equivalent to self-training with no unlabeled points left)"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both is correct. In terms of guiding the user's attention, I would mention Self-training first, because this way it's easier to see that we're not breaking the pattern in the 3x2 table.
If you chose to stay with your suggestion: Maybe not mention the 100% labeled data for the SVC because it cannot use less than that and its clearer to mention the 100% with the Self-training only.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the wording in 035d443?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's fine. :)

@StefanieSenger StefanieSenger added the Waiting for Second Reviewer First reviewer is done, need a second one! label Sep 15, 2025
Copy link
Copy Markdown
Member

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @ArturoAmorQ!

I have a few minor suggestions about using the :class: role more consistently throughout the example.

Otherwise, LGTM and is ready to merge!

Co-authored-by: Virgil Chan <virchan.math@gmail.com>
@virchan
Copy link
Copy Markdown
Member

virchan commented Oct 3, 2025

@StefanieSenger, could we merge this?

@StefanieSenger
Copy link
Copy Markdown
Member

StefanieSenger commented Oct 4, 2025

I went through the changes again and I think it's all fine, except I found a problem with the legend. In the rendered docs it appeared outside of the figure:

Image

I fixed it and now it shows up.

Image

I could not make a suggestion that surpasses several lines on the github files tab (strange bug) and I just commited the change directly. If it looks fine on the CI, this can be merged.

@StefanieSenger StefanieSenger merged commit 05b031c into scikit-learn:main Oct 5, 2025
36 checks passed
@ArturoAmorQ ArturoAmorQ deleted the rework_semi_supervised branch October 6, 2025 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Waiting for Second Reviewer First reviewer is done, need a second one!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants