Skip to content

sklearn.metrics.consensus_score potentially gives wrong results #2445

@untom

Description

@untom

Hi!

sklearn.metrics.consensus_score() gives wrong scores if the two results to be compared contain different numbers of biclusters. This is because the function contains as its final line:

return np.trace(matrix[:, indices[:, 1]]) / max(n_a, n_b)

which uses np.trace under the assumption that matrix (the similarity matrix) is square, and thus contains the most similar items in its diagonal.

However, when matrix is non-square (i.e., n_b != n_a in the code), this fails. I have an example dataset that shows such a case, deposited under: https://www.dropbox.com/sh/plmsqof84xhtxry/7lIrdvX0mp . Just use:

import sklearn.metrics
a_rows = np.loadtxt("/home/tom/a_rows.txt")
a_cols = np.loadtxt("/home/tom/a_cols.txt")
b_rows = np.loadtxt("/home/tom/b_rows.txt")
b_cols = np.loadtxt("/home/tom/b_cols.txt")
print sklearn.metrics.consensus_score((a_rows, a_cols), (b_rows, b_cols))

This gives a consensus-score of ~0.328, however the real score should be ~0.529

The bug can be fixed by exchanging the last line of the function to:

return matrix[indices[:, 0], indices[:, 1]].sum() / max(n_a, n_b)

(I can send a pull request if necessary, however since it's just a single-line fix I'm not sure it's worth it)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions