DOC Rework Decision boundary of semi-supervised methods example#32024
DOC Rework Decision boundary of semi-supervised methods example#32024StefanieSenger merged 13 commits intoscikit-learn:mainfrom
Conversation
StefanieSenger
left a comment
There was a problem hiding this comment.
Thank you for this awesome enhancement of the example, @ArturoAmorQ! 💟
I really learned a lot while going through it and especially the section on " predict_proba in LabelSpreading" clicked much better with me than what we have in the user guide.
Maybe the last paragraph where LabelSpreading and SelfTrainingClassifier are compared, we could also mention that in LabelSpreading, predictions (including predict_proba) depend on the training set (that keeps being stored), whereas with SelfTrainingClassifier (and SVC as a base estimator) the model learns a decision rule that exists independently of the training data after fitting (and is thus more abstract/better generalisable?).
Apart from that I have only commented on some nits.
examples/semi_supervised/plot_semi_supervised_versus_svm_iris.py
Outdated
Show resolved
Hide resolved
examples/semi_supervised/plot_semi_supervised_versus_svm_iris.py
Outdated
Show resolved
Hide resolved
examples/semi_supervised/plot_semi_supervised_versus_svm_iris.py
Outdated
Show resolved
Hide resolved
examples/semi_supervised/plot_semi_supervised_versus_svm_iris.py
Outdated
Show resolved
Hide resolved
examples/semi_supervised/plot_semi_supervised_versus_svm_iris.py
Outdated
Show resolved
Hide resolved
| plt.title(title) | ||
|
|
||
| plt.suptitle("Unlabeled points are colored white", y=0.1) | ||
| rbf_svc = (base_classifier.fit(X, y), y, "SVC with rbf kernel (100% labeled data)") |
There was a problem hiding this comment.
Or maybe:
| rbf_svc = (base_classifier.fit(X, y), y, "SVC with rbf kernel (100% labeled data)") | |
| rbf_svc = (base_classifier.fit(X, y), y, "Self-training with 100% labeled data (equivalent to SVC with rbf kernel)") |
There was a problem hiding this comment.
It's the other way around, right?
"SVC with rbf kernel and 100% labeled data (equivalent to self-training with no unlabeled points left)"
There was a problem hiding this comment.
Both is correct. In terms of guiding the user's attention, I would mention Self-training first, because this way it's easier to see that we're not breaking the pattern in the 3x2 table.
If you chose to stay with your suggestion: Maybe not mention the 100% labeled data for the SVC because it cannot use less than that and its clearer to mention the 100% with the Self-training only.
examples/semi_supervised/plot_semi_supervised_versus_svm_iris.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>
virchan
left a comment
There was a problem hiding this comment.
Thanks for the PR, @ArturoAmorQ!
I have a few minor suggestions about using the :class: role more consistently throughout the example.
Otherwise, LGTM and is ready to merge!
examples/semi_supervised/plot_semi_supervised_versus_svm_iris.py
Outdated
Show resolved
Hide resolved
examples/semi_supervised/plot_semi_supervised_versus_svm_iris.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Virgil Chan <virchan.math@gmail.com>
|
@StefanieSenger, could we merge this? |


Reference Issues/PRs
See also #31625.
What does this implement/fix? Explain your changes.
Related to the series of examples I am reworking, this PR:
DecisionBoundaryDisplayinstead of hard-coding the decision boundary;predict_probainstead of hard predictions;predict_probaworks for both methods.Any other comments?
I am aware that removing this example was suggested in #31499 (comment), but I think it can still provide value in terms of visualizing probabilities, which is something that cannot be done in Semi-supervised Classification on a Text Dataset, as argued in said discussion.