Updated notes in documentation regarding macro-F1 in _classification.py #19589
Conversation
Added comment to resolve ambiguities of the macro F1 score, which is not a harmonic mean.
|
Interesting. I had never thought that combining macro P and R was the sensible thing to do: it doesn't tell you about the distribution of anything meaningful. White it might be useful to acknowledge this statistical intuition, I am a little hesitant about citing empirical results that have not been peer reviewed, since I would not want scikit-learn to imply any claim that your empirical framing of the question is sufficient and correct. I might prefer a simpler wording like "Note that macro F1 is not the harmonic mean of macro recall and precision, although some publications define it thus." |
Yes, that is a more sensible wording. Thanks. Also feel free to edit my wording. It may also be imaginable to add a note somewhere in the user guide instead. In general, I think such a note might be valuable because it would address lots of confusion over the internet about the two versions by providing a brief theoretical view why one metric is clearly preferable.
I totally can understand this. But I can say more on this, since I am one of the author. The paper is a brief note and doesn't contain any sorts of empirical results/framing, that's also why it is not intended to be submitted anywhere (there is a little visualization experiment which serve to outline some findings of the theoretical analysis). It just contains a mathematical analysis, mainly of the Delta between the two metrics in question, that proves, e.g., that the alternatively used version of macro F1 can lead to very misleading (high) evaluation scores and is always >= the version that is, e.g., used in sklearn. If you or anyone else has any questions on the proofs, I am happy to help! |
|
Yes, an aside in the user guide could go a bit further to explain the
dispute.
Thanks for the summary of the paper's argument. I had not got to looking at
it.
|
|
Thanks, if you choose to look into it, I think you’ll have no problems following the theoretic argument in the paper. Maybe to the “defense” of the other macro F1 version [5], one could say that the “outer” function is indeed a harmonic mean, which may be good for the sole sake that it is suggested by the function name itself. I.e., if I understand you correctly, this is what you mean when saying “statistical intuition”. And a user that is not (so) familiar with the sklearn documentation could understand that when they call f1_score(...param=macro...) that they retrieve a harmonic mean as in [5], in the sense that some specified “inner” function (param) is some (Precision-Recall) macro average and the “outer” is a true harmonic mean as the function name (f1_score) itself indeed could suggest: outer(inner(x)). Although what sklearn really does here is macro(f1_score) or inner(outer(x)). Yes, I agree with you , an aside note, maybe similar to the notes in balanced accuracy, could also help to explain the dispute and also why it's good that sklearn uses this one over the other. For now I will update this pull request with a commit that includes your suggest wording, which is much better than mine. |
|
@glemaitre any thoughts on adding an explanation in the user guide to clarify which formula scikit-learn uses? |
|
@lucyleeow yes we could always improve the user guide documentation |
|
Whoops sorry errant close and I realise I can't open as the OP has deleted their account. Will work on continuing this PR. |
|
Don't ping GitHub ghost, you'll get hunted for ever :) |
Added comment to resolve ambiguities of the macro F1 score, where people are confused by two formulas. This issue has also popped up repeatedly on stack overflow, etc, e.g. [1][2][3][4]... I added a pointer to our paper that mathematically analyses the two formulas and shows that the version implemented in scikit-learn may be the preferable one.
[1] https://stats.stackexchange.com/questions/465157/f1-score-macro-average?noredirect=1&lq=1
[2] https://stackoverflow.com/questions/66392243/why-macro-f1-measure-cant-be-calculated-from-macro-precision-and-recall
[3] https://stats.stackexchange.com/questions/471770/multi-class-evaluation-found-different-macro-f1-scores-which-one-to-use
[4] https://towardsdatascience.com/a-tale-of-two-macro-f1s-8811ddcf8f04
Reference Issues/PRs
What does this implement/fix? Explain your changes.
A highly cited paper (3000+ citations) from 2009 [5] and other papers define the macro F1 score has an harmonic mean of average Precision and average Recall. Other papers define the macro F1 (like sklearn!) differently, as the average of class-wise harmonic Precision/Recall means.
[5] Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks.
I added a reference that provides some analysis of the two versions and indicates that the sklearn version may be preferable. I think this may prevent future confusion.
Any other comments?