-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
DOC: sklearn.metrics.auc_score should mention that using probabilities will give better scores #1393
Copy link
Copy link
Closed
Labels
Description
The documentation at: http://scikit-learn.org/dev/modules/generated/sklearn.metrics.auc_score.html#sklearn.metrics.auc_score
says that y_score can be either probability estimates of the positive class, or binary decisions.
It should warn the reader that by using binary decisions, it is only able to compute AUC as if the classifier only returned probabilities 0 and 1 and thus not give the "real" AUC.
Here is an example:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import cross_validation
from sklearn import datasets
data = datasets.load_digits()
X, y = data.data, data.target
# make the classification problem binary
X = X[(y == 8) | (y == 6)]
y = y[(y == 8) | (y == 6)]
clf = LogisticRegression(C=0.001)
k_fold = cross_validation.KFold(len(y), k=10, indices=True, shuffle=True, random_state=18)
AUCs = []
AUCs_proba = []
for train, test in k_fold:
clf.fit(X[train], y[train])
AUCs.append(metrics.auc_score(y[test], clf.predict(X[test])))
AUCs_proba.append(metrics.auc_score(y[test], clf.predict_proba(X[test])[:, 1]))
print "AUCs: "
print AUCs
print "AUCs (with probabilities): "
print AUCs_probaThis is the output:
AUCs:
[1.0, 0.97222222222222221, 1.0, 0.97058823529411764, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
AUCs (with probabilities):
[1.0, 1.0, 1.0, 0.99673202614379086, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]I admit this is not a very good example, as the difference between AUCs and AUCs_proba could be a lot bigger in practice, but I wanted to use a built-in data set.
Note that AUC computed from binary decisions is always inferior to the AUC computed with probability estimates.
Reactions are currently unavailable