-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
sklearn.metrics.consensus_score potentially gives wrong results #2445
Description
Hi!
sklearn.metrics.consensus_score() gives wrong scores if the two results to be compared contain different numbers of biclusters. This is because the function contains as its final line:
return np.trace(matrix[:, indices[:, 1]]) / max(n_a, n_b)
which uses np.trace under the assumption that matrix (the similarity matrix) is square, and thus contains the most similar items in its diagonal.
However, when matrix is non-square (i.e., n_b != n_a in the code), this fails. I have an example dataset that shows such a case, deposited under: https://www.dropbox.com/sh/plmsqof84xhtxry/7lIrdvX0mp . Just use:
import sklearn.metrics
a_rows = np.loadtxt("/home/tom/a_rows.txt")
a_cols = np.loadtxt("/home/tom/a_cols.txt")
b_rows = np.loadtxt("/home/tom/b_rows.txt")
b_cols = np.loadtxt("/home/tom/b_cols.txt")
print sklearn.metrics.consensus_score((a_rows, a_cols), (b_rows, b_cols))
This gives a consensus-score of ~0.328, however the real score should be ~0.529
The bug can be fixed by exchanging the last line of the function to:
return matrix[indices[:, 0], indices[:, 1]].sum() / max(n_a, n_b)
(I can send a pull request if necessary, however since it's just a single-line fix I'm not sure it's worth it)