FIX helper to check multilabel types#1985
FIX helper to check multilabel types#1985jnothman wants to merge 39 commits intoscikit-learn:masterfrom
Conversation
|
I'm not sure about all the terminology and codes returned. |
|
Can you have a look at #1643 ? I would use something like |
|
I don't mind removing the size checks, and I'm happy to change |
|
And after looking at #1643, I'll adopt |
I am thinking the same: it's a source of confusion and makes code more |
Yes, this can be remove. |
sklearn/utils/multiclass.py
Outdated
There was a problem hiding this comment.
Can you get rid of the magic *?
This is not necessary here.
This function doesn't return a boolean, so could rename this function
get_something, e.g. get_label_type or get_classification_type?
There was a problem hiding this comment.
Can you get rid of the magic *?
This is not necessary here.
What I mean is that if you have a get_classification_type function,
you can easily compare the format of y_pred and y_true without
using the magic *:
>>> ind = np.array([[0, 1], [1, 0]])
>>> seq = [[0], [0, 2]]
>>>get_classification_type(ind)
"multilabel-indicator-matrix"
>>> get_classification_type(seq)
"multilabel-sequence-of-sequence"
>>> get_classification_type(ind) == get_classification_type(seq)
FalseThere was a problem hiding this comment.
This function doesn't return a boolean, so could rename this function
get_something, e.g.get_label_typeorget_classification_type?
With the given name, you return a string output and a boolean output.
That's counter-intuitive.
|
Could you add tests in |
|
Since you want to add the support for list of list of string, I think that you should extend the invariance test for multilabel data format in |
|
I did support binary and multiclass, just not in the case of multiple. Your comments are fair. I think I think the returned strings should be:
I thought this might be okay in utils, but can change it.
List of list of string was already supported, effectively (it passed |
|
maximally: def type_of_target(y):
if is_sequence_of_sequences(y):
return 'multilabel-sequences'
elif is_label_indicator_matrix(y):
return 'multilabel-indicator'
y = np.asarray(y)
if issubclass(y.dtype.type, np.float) and np.any(y != y.astype(int)):
return 'regression' # or 'continuous'? 'real'? delete this case?
if len(np.unique(y)) == 2:
return 'binary'
else:
return 'multiclass' |
|
And maybe something like this would be useful to help metrics pass invariance tests (this is untested): def _check_clf_targets(y_true, y_pred, encode=False, ravel=True):
y_true, y_pred = check_arrays(y_true, y_pred, allow_lists=True)
type_true = type_of_target(y_true)
type_pred = type_of_target(y_pred)
labels = unique_labels(y_true, y_pred)
if type_true.startswith('multilabel'):
if not type_pred.startswith('multilabel'):
raise ValueError("Can't handle mix of multilabel and multiclass targets")
if type_true != type_pred:
enc = LabelBinarizer()
enc.fit([labels.tolist()])
y_true = enc.transform(y_true)
y_pred = enc.transform(y_pred)
type_true = type_pred = 'multilabel-indicator'
elif type_pred.startswith('multilabel'):
raise ValueError("Can't handle mix of multilabel and multiclass targets")
elif 'regression' in (type_true, type_pred):
raise ValueError("Can't handle continuous targets")
else:
if 'multiclass' in (type_true, type_pred):
# 'binary' can be removed
type_true = type_pred = 'multiclass'
y_true, y_pred = _check_1d_array(y_true, y_pred, ravel=ravel)
if encode and type_true != 'multilabel-indicator':
enc = LabelEncoder()
enc.fit([labels.tolist()] if type_true.startswith('multilabel')
else labels.tolist())
y_true = enc.transform(y_true)
y_pred = enc.transform(y_pred)
return labels, type_true, y_true, y_pred |
|
I like your Will it simplify anything to support the binary case or regression case? |
|
Well, Regression? (Or "continuous"... or "real-valued".) I can't think of anything, except to produce a ValueError for categorical metrics if the output of |
|
okay, have another look. |
sklearn/metrics/metrics.py
Outdated
|
You can add your name in the authors of |
sklearn/metrics/metrics.py
Outdated
There was a problem hiding this comment.
Is it necessary to have this option at the moment?
Is it used in the codebase?
|
Rebased on master. |
|
In terms of performance, I guess the slow checks in PRF benchmark (10 classes; multiclass
These results make me wonder how long binarizing takes for multiclass data and whether it's worth doing in general, and the answer seems to be "almost certainly": @arjoly might be interested... |
|
Thanks for the benchmark! |
|
I take back "almost certainly": of course,
That's multiclass 1e6 samples. What about sequences of sequences 1e3 samples?
Well, that's a nice reduction. Perhaps we should always binarize sequences of sequences in PRF. |
sklearn/utils/multiclass.py
Outdated
There was a problem hiding this comment.
apparently this fails for np.int8, etc. Fix and tests coming soon.
|
I should note that much of that binarizing time is taken up by finding the unique labels in |
|
Thanks @jnothman for the rebase and the benchmarks. I think we are good to merge. 👍 on my side. |
|
Well, what've we got to lose? Merged as fd3bd2d. |
|
@arjoly: let the rebasing begin! |
|
thanks a lot @jnothman !!! :-) |
Fixes bugs related to simply checking type(..) equality and refactors.
@arjoly does this seem reasonable?