Updated notes in documentation regarding macro-F1 in _classification.py · Pull Request #19589 · scikit-learn/scikit-learn

ghost · 2021-03-01T17:05:35Z

Added comment to resolve ambiguities of the macro F1 score, where people are confused by two formulas. This issue has also popped up repeatedly on stack overflow, etc, e.g. [1][2][3][4]... I added a pointer to our paper that mathematically analyses the two formulas and shows that the version implemented in scikit-learn may be the preferable one.

[1] https://stats.stackexchange.com/questions/465157/f1-score-macro-average?noredirect=1&lq=1
[2] https://stackoverflow.com/questions/66392243/why-macro-f1-measure-cant-be-calculated-from-macro-precision-and-recall
[3] https://stats.stackexchange.com/questions/471770/multi-class-evaluation-found-different-macro-f1-scores-which-one-to-use
[4] https://towardsdatascience.com/a-tale-of-two-macro-f1s-8811ddcf8f04

Reference Issues/PRs

What does this implement/fix? Explain your changes.

A highly cited paper (3000+ citations) from 2009 [5] and other papers define the macro F1 score has an harmonic mean of average Precision and average Recall. Other papers define the macro F1 (like sklearn!) differently, as the average of class-wise harmonic Precision/Recall means.

[5] Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks.

I added a reference that provides some analysis of the two versions and indicates that the sklearn version may be preferable. I think this may prevent future confusion.

Any other comments?

Added comment to resolve ambiguities of the macro F1 score, which is not a harmonic mean.

fix url

jnothman · 2021-03-03T22:03:05Z

Interesting. I had never thought that combining macro P and R was the sensible thing to do: it doesn't tell you about the distribution of anything meaningful. White it might be useful to acknowledge this statistical intuition, I am a little hesitant about citing empirical results that have not been peer reviewed, since I would not want scikit-learn to imply any claim that your empirical framing of the question is sufficient and correct.

I might prefer a simpler wording like "Note that macro F1 is not the harmonic mean of macro recall and precision, although some publications define it thus."

ghost · 2021-03-04T07:56:36Z

"Note that macro F1 is not the harmonic mean of macro recall and precision, although some publications define it thus."

Yes, that is a more sensible wording. Thanks. Also feel free to edit my wording. It may also be imaginable to add a note somewhere in the user guide instead. In general, I think such a note might be valuable because it would address lots of confusion over the internet about the two versions by providing a brief theoretical view why one metric is clearly preferable.

I am a little hesitant about citing empirical results that have not been peer reviewed, since I would not want scikit-learn to imply any claim that your empirical framing of the question is sufficient and correct.

I totally can understand this. But I can say more on this, since I am one of the author. The paper is a brief note and doesn't contain any sorts of empirical results/framing, that's also why it is not intended to be submitted anywhere (there is a little visualization experiment which serve to outline some findings of the theoretical analysis). It just contains a mathematical analysis, mainly of the Delta between the two metrics in question, that proves, e.g., that the alternatively used version of macro F1 can lead to very misleading (high) evaluation scores and is always >= the version that is, e.g., used in sklearn.

If you or anyone else has any questions on the proofs, I am happy to help!

jnothman · 2021-03-04T08:36:48Z

Yes, an aside in the user guide could go a bit further to explain the dispute. Thanks for the summary of the paper's argument. I had not got to looking at it.

ghost · 2021-03-04T15:37:53Z

Thanks, if you choose to look into it, I think you’ll have no problems following the theoretic argument in the paper.

Maybe to the “defense” of the other macro F1 version [5], one could say that the “outer” function is indeed a harmonic mean, which may be good for the sole sake that it is suggested by the function name itself. I.e., if I understand you correctly, this is what you mean when saying “statistical intuition”. And a user that is not (so) familiar with the sklearn documentation could understand that when they call f1_score(...param=macro...) that they retrieve a harmonic mean as in [5], in the sense that some specified “inner” function (param) is some (Precision-Recall) macro average and the “outer” is a true harmonic mean as the function name (f1_score) itself indeed could suggest: outer(inner(x)). Although what sklearn really does here is macro(f1_score) or inner(outer(x)).

Yes, I agree with you , an aside note, maybe similar to the notes in balanced accuracy, could also help to explain the dispute and also why it's good that sklearn uses this one over the other.

For now I will update this pull request with a commit that includes your suggest wording, which is much better than mine.

pylinting

lucyleeow · 2024-02-01T01:56:01Z

@glemaitre any thoughts on adding an explanation in the user guide to clarify which formula scikit-learn uses?

glemaitre · 2024-02-01T15:31:45Z

@lucyleeow yes we could always improve the user guide documentation

lucyleeow · 2024-02-02T01:33:15Z

@ghost

lucyleeow · 2024-02-02T01:35:40Z

Whoops sorry errant close and I realise I can't open as the OP has deleted their account. Will work on continuing this PR.

glemaitre · 2024-02-05T14:07:27Z

Don't ping GitHub ghost, you'll get hunted for ever :)

Update _classification.py

dc49810

Added comment to resolve ambiguities of the macro F1 score, which is not a harmonic mean.

github-actions bot added the module:metrics label Mar 1, 2021

ghost changed the title ~~Update _classification.py~~ Updated documentation regarding macro-F1 of _classification.py Mar 1, 2021

ghost changed the title ~~Updated documentation regarding macro-F1 of _classification.py~~ Updated documentation regarding macro-F1 in _classification.py Mar 1, 2021

ghost changed the title ~~Updated documentation regarding macro-F1 in _classification.py~~ Updated notes in documentation regarding macro-F1 in _classification.py Mar 1, 2021

opitz and others added 2 commits March 1, 2021 22:58

fix pylinting

cf0956d

Update _classification.py

32b8c7e

fix url

hamlet-father-ghost added 2 commits March 4, 2021 16:48

better wording as suggesteted by @jnothman

35a4ba9

pylinting

aba79b0

pylinting

lucyleeow closed this Feb 2, 2024

lucyleeow mentioned this pull request Feb 13, 2024

DOC Add note clarifying how f measures are calculated #28411

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updated notes in documentation regarding macro-F1 in _classification.py #19589

Updated notes in documentation regarding macro-F1 in _classification.py #19589
ghost wants to merge 5 commits intomainfrom
unknown repository

ghost commented Mar 1, 2021 •

edited by ghost

Loading

Uh oh!

jnothman commented Mar 3, 2021

Uh oh!

ghost commented Mar 4, 2021 •

edited by ghost

Loading

Uh oh!

jnothman commented Mar 4, 2021 via email

Uh oh!

ghost commented Mar 4, 2021 •

edited by ghost

Loading

Uh oh!

lucyleeow commented Feb 1, 2024

Uh oh!

glemaitre commented Feb 1, 2024

Uh oh!

lucyleeow commented Feb 2, 2024

Uh oh!

lucyleeow commented Feb 2, 2024 •

edited

Loading

Uh oh!

glemaitre commented Feb 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ghost commented Mar 1, 2021 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Mar 3, 2021

Uh oh!

ghost commented Mar 4, 2021 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Mar 4, 2021 via email

Uh oh!

ghost commented Mar 4, 2021 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucyleeow commented Feb 1, 2024

Uh oh!

glemaitre commented Feb 1, 2024

Uh oh!

lucyleeow commented Feb 2, 2024

Uh oh!

lucyleeow commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Feb 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghost commented Mar 1, 2021 •

edited by ghost

Loading

ghost commented Mar 4, 2021 •

edited by ghost

Loading

ghost commented Mar 4, 2021 •

edited by ghost

Loading

lucyleeow commented Feb 2, 2024 •

edited

Loading