Skip to content

Added gini coefficient to ranking and scorer#10084

Closed
tagomatech wants to merge 4 commits intoscikit-learn:masterfrom
tagomatech:master
Closed

Added gini coefficient to ranking and scorer#10084
tagomatech wants to merge 4 commits intoscikit-learn:masterfrom
tagomatech:master

Conversation

@tagomatech
Copy link
Copy Markdown

Added a function at the end of sklearn\metrics\ranking.py to compute the Gini coefficient which is being used in some Kaggle competitions.

I added the corresponding import declaration in sklearn\metrics\__init__.py

Finally, I create a scorer à la sklearn in sklearn\metrics\sorer.py, so that the gini coefficient can be used across sklearn validation/metrics functions, e.g. cross_val_score .

Reference was taken here and results were checked against several entries on Kaggle and sklearn AUC/ROC score (is it not rocket_science, to be honest).

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this to metrics/tests/test_common.py and also add specific tests that this matches known scores on toy datasets.

return np.mean(scores)


def gini(y_true, y_score):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps name this gini_score for consistency

----------
.. [1] David J. Hand and Robert J. Till (2001).
A Simple Generalisation of the Area Under the ROC Curve for
Multiple Class Classification Problems. In Machine Learning, 45,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your implementation does not currently extend to multiclass. You have merely implemented a chance corrected binary roc

@qinhanmin2014
Copy link
Copy Markdown
Member

@tagomatech Could you please explain why do we need gini coefficient since we already have roc_auc_score? It can almost be replaced by roc_auc_score and it seems hard to find any reference about its definition and application in ML. I don't think the paper your provide is a good reference. It only states that gini index(gini coefficient?) is equivalent to roc_auc_score and the whole paper is based on roc_auc_score.
(Forgive me if there's something wrong :) )

@tagomatech
Copy link
Copy Markdown
Author

@qinhanmin2014
Adding this function is a small improvement, indeed. Personally, I find it useful when playing around Kaggle competitions.
As per the sources, there is a lot of confusion about "Gini index", "Gini coefficient", "Normalized Coefficient". The source I suggested possesses the virtue of being unambiguous, by defining Gini in relation to AUC.

@qinhanmin2014
Copy link
Copy Markdown
Member

@tagomatech Thanks.
I think we have reached consensus that:
(1)The metrics can almost be replaced by roc_auc_score.
(2)It is difficult to find reference about its definition and application in ML. Right?
So I might be -1 for the metric.
Also, from my perspective, kaggle can be the application of our metrics, but might be difficult to serve as the (main) origin of our metrics, because in some cases, their metrics are designed for special scenario.
This is only my personal opinion so feel free to fix the conflict, make CIs green, provide more persuasive literature and wait for the opinion from core devs.

@glemaitre
Copy link
Copy Markdown
Member

I am -1 to merge since the score can be easily computed from the ROC AUC.
I would also think that it could be some confusion between the Gini impurity used the decision tree and the Gini coefficient.

@qinhanmin2014
Copy link
Copy Markdown
Member

@tagomatech Thanks a lot for your contribution. Sorry but I'm going to close this one with the another -1 above. I think the general consensus is that it can be replaced by roc_auc_score and there's no clear definition.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Oct 10, 2019

Actually the Gini coefficient is defined in terms of area under the Lorenz curve (for positive regression models) which is not the same as ROC AUC. I started an undocumented prototype implementation in #15176.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants