Skip to content

DOC improved plot_semi_supervised_newsgroups.py example#31104

Merged
ArturoAmorQ merged 8 commits intoscikit-learn:mainfrom
elhambbi:cont03
Aug 21, 2025
Merged

DOC improved plot_semi_supervised_newsgroups.py example#31104
ArturoAmorQ merged 8 commits intoscikit-learn:mainfrom
elhambbi:cont03

Conversation

@elhambbi
Copy link
Copy Markdown
Contributor

Reference Issues/PRs

Towards #30621 and with reference to PR #30882, the code for plot_semi_supervised_newsgroups.py is improved.

What does this implement/fix? Explain your changes.

  • The example and the methods used are described more
  • Old code is updated e.g. f-strings, prints etc
  • A bar plot is added to compare the F1 score of different approaches

Any other comments?

LabelSpreading, as a semi-supervised method, is not better than the fully supervised models in terms of F1 score. Should we keep it or is SelfTraining enough to demonstrate the superiority of semi-supervised methods when having limited data?

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 29, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 69d21de. Link to the linter CI: here

@elhambbi
Copy link
Copy Markdown
Contributor Author

Hi @StefanieSenger . Could you please review the modifications I've done to this example and let me know if we need to change something?
Thank you.

Copy link
Copy Markdown
Member

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your work, @elhambbi. It's going to upgrade the example a lot!

I especially like the part in the beginning that explains the intend of the example and summarises the approaches that get compared to each other.

I have a few little comments and also it would be nice to make the code more notebook style like and introduce some sub-sections.

Would you mind doing that?

Comment on lines +10 to +13
1. Supervised learning using 100% of labeled data (baseline)
- Uses SGDClassifier with TF-IDF features
- Represents the best possible performance with full supervision
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bullet points don't get rendered as intended. Can you please fix this?
You can see the rendering by building the documentation locally or by clicking "check the rendered docs" in the CI.

ls_pipeline, X_train, y_train_semi, X_test, y_test
)

# Create the plot
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that instead of having comments like # Create the plot, # Define colors for each bar, # Customize plot, or # Add value labels on bars, it would be nice to add 1-2 sentences on top of this part explaining what will happen in this next section which would adhere more to the notebook style that we thrive for with the examples. This is a good example to look at, if you need some inspiration, @elhambbi: #26956

@StefanieSenger
Copy link
Copy Markdown
Member

Hi @elhambbi, did you have some time to look into this?

@elhambbi
Copy link
Copy Markdown
Contributor Author

elhambbi commented May 6, 2025

Hi @StefanieSenger. I apologize, I've been too busy lately. I'll work on it over the weekend

@StefanieSenger
Copy link
Copy Markdown
Member

Hi @elhambbi, no rush. I just wanted to know whether this PR is still active. Take your time.

@elhambbi
Copy link
Copy Markdown
Contributor Author

Hi @StefanieSenger. Sorry, I've been quite busy lately. I have modified the code. Please let me know if it needs further improvement. Thank you

@StefanieSenger StefanieSenger self-requested a review May 27, 2025 08:38
Copy link
Copy Markdown
Member

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @elhambbi!

I have only got one little suggestion on the notebook style and some nit picks. Overall it looks pretty good to me.

Feel free to do more of the notebook style-like changes, but from my side with having these two blocks I'm pretty happy and would then forward this to a maintainer to have a look.

@elhambbi elhambbi reopened this Jun 1, 2025
@elhambbi
Copy link
Copy Markdown
Contributor Author

elhambbi commented Jun 1, 2025

Hi @StefanieSenger . I made the changes, and the file is updated now. I closed the PR by mistake and then reopened it. The commit history is gone but the code is correct with all the changes we discussed. Thank you

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 2, 2025

@elhambbi thanks for the PR. Could you please push a fix to add the missing new line as reported by the linter?

https://app.circleci.com/pipelines/github/scikit-learn/scikit-learn/67920/workflows/b45c787a-c741-4361-9085-15c44eb03ecd/jobs/306979

This would help get the Continuous Integration to proceed with building the HTML preview of example edited by the PR to make it easier to review.


You can adjust the number of categories by giving their names to the dataset
loader or setting them to `None` to get all 20 of them.
1. Supervised learning using 100% of labeled data (baseline)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In wouldn't call this a baseline: if we do not have access to 100% labeled data, it will probably not be possible to reach the performance of this model. I would rather call it a "best case scenario".

- Uses SGDClassifier with TF-IDF features
- Represents the best possible performance with full supervision

2. Supervised learning using only 20% of labeled data
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This on the other hand could be called a "baseline" to compare semi-supervised learning methods to.


# select a mask of 20% of the train dataset
y_mask = np.random.rand(len(y_train)) < 0.2
# Evaluate supervised model with 100% of training data
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably split each call to eval_and_get_f1 in its own cell by using # %% delimiters. Then the comments such as # Evaluate supervised model with 100% of training data should be slightly rephrased as grammatically correct sentences or paragraphs instead of just short inline code comments as @StefanieSenger suggested above.

@StefanieSenger
Copy link
Copy Markdown
Member

Thanks for your work, @elhambbi! I see this really taking shape and that a core developer is now reviewing this (not just me as a team member) shows the progress of this PR. (Please: for expectation management, what I wrote in the What comes next? section of #30621 applies here: a PR like that is expected to be thoroughly reviewed and the result will be something a new contributor can be very proud of once it is finished.)

I will review more later. Just a little comment on the commit history: it has disappeared because of force-pushing, which we try to avoid at all costs. PRs are easier to review when we can see the commit history and as someone coming back to this PR regularily, I can rely on only looking at the newest changes instead of going through everything from the beginning. (Because I only have so much mental capacity and I jump a lot between very different topics and PRs.)

@elhambbi
Copy link
Copy Markdown
Contributor Author

Hi @StefanieSenger. I improved the code based on what @ogrisel suggested.
I also added the link to this example in doc/modules/semi_supervised.rst as @adrinjalali mentioned in #30882 and closed it.
Please let me know your feedback. I'll improve it further, if needed. Thank you.

Copy link
Copy Markdown
Member

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @elhambbi,

thanks for your further work! 😃

In the file, I have left a few minor comments. It would be a nice service to users, to turn some of the mentions of classes or functions into cross-references. I left a few examples on the file, please do add some more where you find them suitable.

And a few would-like-to-have's, that I think would be great, but not required:

  • Add a short explanation why we use the micro-averaged f1_score here.
  • Link this example also in the docstring of f1_score, right in the average param, where the micro-argument is explained.

print("Supervised SGDClassifier on 100% of the data:")
eval_and_print_metrics(pipeline, X_train, y_train, X_test, y_test)
# %%
# 1. Evaluate a supervised SGDClassifier trained on 100% of the labeled data.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# 1. Evaluate a supervised SGDClassifier trained on 100% of the labeled data.
# 1. Evaluate a supervised SGDClassifier trained on 100% of the data.

Same comment as above.

loader or setting them to `None` to get all 20 of them.
1. Supervised learning using 100% of labeled data (best-case scenario)

- Uses SGDClassifier with TF-IDF features
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Uses SGDClassifier with TF-IDF features
- Uses :class:`~sklearn.linear_model.SGDClassifier` with :class:`TF-IDF <sklearn.feature_extraction.text.TfidfTransformer>` features

Adding Sphinx cross-references to link to other parts of our API docs. The second one I am actually not sure if it renders correctly (you would have to check).

For a function, you can use the :func: role in the beginning.

Here is a link to the sphinx docs on using cross-references, if you want to know more, but in general, it is enough to search for similar examples in scikit-learn and do it alike.

@StefanieSenger
Copy link
Copy Markdown
Member

Hey @elhambbi, just a quick note to know if you are still working on this PR? Best!

Copy link
Copy Markdown
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @elhambbi thanks for the PR and sorry for jumping into the discussion so late! This is a big improvement to the example. Here is a batch of suggestions for your consideration.

Also the current lines 90-91 can be erased as they are no longer necessary:

        # LabelSpreading does not support dense matrices
        ("toarray", FunctionTransformer(lambda x: x.toarray())),

as well as the corresponding FunctionTransformer import.

On a side comment, it is easier for reviewers to keep track of you applying the suggestions if you accept them (as a batch) using the github interface.

@ArturoAmorQ
Copy link
Copy Markdown
Member

Hi @elhambbi, if your bandwidth doesn't allow you finishing this PR, do you mind the maintainers pushing their suggestions? We are really close to a merge!

Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>
@elhambbi
Copy link
Copy Markdown
Contributor Author

Hi @StefanieSenger , @ArturoAmorQ . Sorry for the late reply, I was on summer break.
Thanks for your comments. I've added your suggestions.

I also placed the imports in the first cell they are called, and added an explanation and a link to the f1-score function. I'll link this example in the docstring of the f1-score function, as you @StefanieSenger suggested, in another PR.

Please let me know if any further improvement is needed. Thank you guys.

Copy link
Copy Markdown
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @elhambbi thanks for pushing those changes. Here are a couple of extra suggestions.

Also you still have to remove lines 93-94 (as of current state of this PR), as well as the corresponding FunctionTransformer import:

        # LabelSpreading does not support dense matrices
        ("toarray", FunctionTransformer(lambda x: x.toarray())),

elhambbi and others added 2 commits August 20, 2025 21:01
Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
@elhambbi
Copy link
Copy Markdown
Contributor Author

Hi @ArturoAmorQ. Thanks for the feedback. The changes are applied. Please let me know if we need any other modifications

Copy link
Copy Markdown
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @elhambbi, merging!

@ArturoAmorQ ArturoAmorQ merged commit 6aa5a6f into scikit-learn:main Aug 21, 2025
36 checks passed
lucyleeow pushed a commit to lucyleeow/scikit-learn that referenced this pull request Aug 22, 2025
@jeremiedbb jeremiedbb mentioned this pull request Sep 3, 2025
13 tasks
@StefanieSenger
Copy link
Copy Markdown
Member

Awesome this PR got merged. Congratulations, @elhambbi! 🎉

Now you have a few merged PRs and some experience in documentation, would you like to get tagged on other doc issues and also on scoped coding tasks?

@elhambbi
Copy link
Copy Markdown
Contributor Author

elhambbi commented Sep 8, 2025

Hi @StefanieSenger. Thank you :)
Yes, of course. Please let me know if I can help with some other tasks.
Also, PR #31625 is still open, I think @ArturoAmorQ wants us to keep the example and improve it instead; let me know if I'm mistaken.

@StefanieSenger
Copy link
Copy Markdown
Member

Nice to hear that, @elhambbi. I will keep my eyes open for a possible next issue for you. Remember you don't need to take it on then, just drop a note in that case.

DeaMariaLeon pushed a commit to DeaMariaLeon/scikit-learn that referenced this pull request Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants