Classification metrics overhaul: stat scores (3/n)#4839
Merged
SkafteNicki merged 175 commits intoLightning-AI:release/1.2-devfrom Dec 30, 2020
tadejsv:cls_metrics_stat_scores
Merged
Classification metrics overhaul: stat scores (3/n)#4839SkafteNicki merged 175 commits intoLightning-AI:release/1.2-devfrom tadejsv:cls_metrics_stat_scores
SkafteNicki merged 175 commits intoLightning-AI:release/1.2-devfrom
tadejsv:cls_metrics_stat_scores
Conversation
|
Hello @tadejsv! Thanks for updating this PR.
Comment last updated at 2020-12-30 18:58:06 UTC |
Collaborator
|
@tadejsv @justusschock @SkafteNicki how is it going here? :] |
Contributor
Author
|
@Borda @SkafteNicki @justusschock @teddykoker @rohitgr7 This is ready for (re)review :) |
Borda
previously approved these changes
Dec 29, 2020
rohitgr7
approved these changes
Dec 29, 2020
SkafteNicki
approved these changes
Dec 30, 2020
Collaborator
SkafteNicki
left a comment
There was a problem hiding this comment.
Great job as always :]
Borda
approved these changes
Dec 30, 2020
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is a spin-off from #4835, based on new input formatting from #4837
This will provide a basis for future PRs for recall, precision, fbeta and iou metrics.
What does this PR do?
top_kparameter for input formatting now also works with multi-label inputsThis was done so that StatScores can also provide a basis for Recall@K and Precision@K later - because these two metrics always take multi-label inputs, and count the top K highest probability predictions as True. For multi-class inputs this parameter works as before.
This addition was done in the input formatting function. This means that multi-label inputs can now be binarized in two ways: through the threshold parameter, or through the top_k parameter. I have decided to give the top_k parameter preference if both are set.
For Top-K Accuracy multi-label inputs don't make sense (or at least I have not seen any use of it), so I have updated the Accuracy metric so that an error is raised if top_k is used with multi-label inputs.
New StatScores metric (and updated functional counterpart)
Computes stat score, i.e. true positives, false positives, true negatives, false negatives. It is used as a base for many other metrics (recall, precision, fbeta, iou). It is made to work with all types of inputs, and is very configurable. There are two main parameters here:
reduce: This determines how should the statistics be counted: globally (summing across all labels), by calsses, or by samples. The possible values (micro,macro,samples), correspond to averaging names for metrics such as precision. This is "inspired" by sklearn's averaging argument in such metrics.mdmc_reduce: In case of multi-dimensional multi-class (mdmc) inputs, how should the statistics be reduced? This is on top of thereduceargument. The possible values areglobal(i.e. extra dimensions are actually sample dimensions) andsamplewise(compute statistics for each sample, taking the extra dimensions as a sample-within-sample dimension).Why? The reason for these two options (right now PL metrics implements the
globaloption by default) is that in some "downstream" metrics, such as iou, it is, in my opinion, much more natural to compute the metric per sample, and then average accross samples, rather than join everyhing into one "blob", and compute the averages for this blob. For example, if you are doing image segmentation, it makes more sense to compute the metrics per image, as the model is trained on images, and not blobs :) Also, aggregation of everything may disguise some unwanted behavior (such as inability to predict a minority class), which would be evident if averaging was done per sample (samplewise).Also, this class metric (and the functional equivalent) now return the stat scores concatenated in a single tensor, instead of returning a tuple. I did this because the standard metrics testing framework in PL does not support non-tensor returns - and the change should be minor for the users.
I have deprecated the
stat_scores_multiple_classesmetric, asstat_scoresis now perfectly capable of handling multiple classes itself.Documentation
Second part of "Input types" section with examples of the use of
is_multiclassparameter withStatScoresis added.