Skip to content

Updated notes in documentation regarding macro-F1 in _classification.py #19589

Closed
ghost wants to merge 5 commits intomainfrom
unknown repository
Closed

Updated notes in documentation regarding macro-F1 in _classification.py #19589
ghost wants to merge 5 commits intomainfrom
unknown repository

Conversation

@ghost
Copy link
Copy Markdown

@ghost ghost commented Mar 1, 2021

Added comment to resolve ambiguities of the macro F1 score, where people are confused by two formulas. This issue has also popped up repeatedly on stack overflow, etc, e.g. [1][2][3][4]... I added a pointer to our paper that mathematically analyses the two formulas and shows that the version implemented in scikit-learn may be the preferable one.

[1] https://stats.stackexchange.com/questions/465157/f1-score-macro-average?noredirect=1&lq=1
[2] https://stackoverflow.com/questions/66392243/why-macro-f1-measure-cant-be-calculated-from-macro-precision-and-recall
[3] https://stats.stackexchange.com/questions/471770/multi-class-evaluation-found-different-macro-f1-scores-which-one-to-use
[4] https://towardsdatascience.com/a-tale-of-two-macro-f1s-8811ddcf8f04

Reference Issues/PRs

What does this implement/fix? Explain your changes.

A highly cited paper (3000+ citations) from 2009 [5] and other papers define the macro F1 score has an harmonic mean of average Precision and average Recall. Other papers define the macro F1 (like sklearn!) differently, as the average of class-wise harmonic Precision/Recall means.

[5] Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks.

I added a reference that provides some analysis of the two versions and indicates that the sklearn version may be preferable. I think this may prevent future confusion.

Any other comments?

Added comment to resolve ambiguities of the macro F1 score, which is not a harmonic mean.
@ghost ghost changed the title Update _classification.py Updated documentation regarding macro-F1 of _classification.py Mar 1, 2021
@ghost ghost changed the title Updated documentation regarding macro-F1 of _classification.py Updated documentation regarding macro-F1 in _classification.py Mar 1, 2021
@ghost ghost changed the title Updated documentation regarding macro-F1 in _classification.py Updated notes in documentation regarding macro-F1 in _classification.py Mar 1, 2021
opitz and others added 2 commits March 1, 2021 22:58
@jnothman
Copy link
Copy Markdown
Member

jnothman commented Mar 3, 2021

Interesting. I had never thought that combining macro P and R was the sensible thing to do: it doesn't tell you about the distribution of anything meaningful. White it might be useful to acknowledge this statistical intuition, I am a little hesitant about citing empirical results that have not been peer reviewed, since I would not want scikit-learn to imply any claim that your empirical framing of the question is sufficient and correct.

I might prefer a simpler wording like "Note that macro F1 is not the harmonic mean of macro recall and precision, although some publications define it thus."

@ghost
Copy link
Copy Markdown
Author

ghost commented Mar 4, 2021

"Note that macro F1 is not the harmonic mean of macro recall and precision, although some publications define it thus."

Yes, that is a more sensible wording. Thanks. Also feel free to edit my wording. It may also be imaginable to add a note somewhere in the user guide instead. In general, I think such a note might be valuable because it would address lots of confusion over the internet about the two versions by providing a brief theoretical view why one metric is clearly preferable.

I am a little hesitant about citing empirical results that have not been peer reviewed, since I would not want scikit-learn to imply any claim that your empirical framing of the question is sufficient and correct.

I totally can understand this. But I can say more on this, since I am one of the author. The paper is a brief note and doesn't contain any sorts of empirical results/framing, that's also why it is not intended to be submitted anywhere (there is a little visualization experiment which serve to outline some findings of the theoretical analysis). It just contains a mathematical analysis, mainly of the Delta between the two metrics in question, that proves, e.g., that the alternatively used version of macro F1 can lead to very misleading (high) evaluation scores and is always >= the version that is, e.g., used in sklearn.

If you or anyone else has any questions on the proofs, I am happy to help!

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Mar 4, 2021 via email

@ghost
Copy link
Copy Markdown
Author

ghost commented Mar 4, 2021

Thanks, if you choose to look into it, I think you’ll have no problems following the theoretic argument in the paper.

Maybe to the “defense” of the other macro F1 version [5], one could say that the “outer” function is indeed a harmonic mean, which may be good for the sole sake that it is suggested by the function name itself. I.e., if I understand you correctly, this is what you mean when saying “statistical intuition”. And a user that is not (so) familiar with the sklearn documentation could understand that when they call f1_score(...param=macro...) that they retrieve a harmonic mean as in [5], in the sense that some specified “inner” function (param) is some (Precision-Recall) macro average and the “outer” is a true harmonic mean as the function name (f1_score) itself indeed could suggest: outer(inner(x)). Although what sklearn really does here is macro(f1_score) or inner(outer(x)).

Yes, I agree with you , an aside note, maybe similar to the notes in balanced accuracy, could also help to explain the dispute and also why it's good that sklearn uses this one over the other.

For now I will update this pull request with a commit that includes your suggest wording, which is much better than mine.

hamlet-father-ghost added 2 commits March 4, 2021 16:48
@lucyleeow
Copy link
Copy Markdown
Member

@glemaitre any thoughts on adding an explanation in the user guide to clarify which formula scikit-learn uses?

@glemaitre
Copy link
Copy Markdown
Member

@lucyleeow yes we could always improve the user guide documentation

@lucyleeow
Copy link
Copy Markdown
Member

@ghost

@lucyleeow lucyleeow closed this Feb 2, 2024
@lucyleeow
Copy link
Copy Markdown
Member

lucyleeow commented Feb 2, 2024

Whoops sorry errant close and I realise I can't open as the OP has deleted their account. Will work on continuing this PR.

@glemaitre
Copy link
Copy Markdown
Member

Don't ping GitHub ghost, you'll get hunted for ever :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants