ENH support multilabel targets in LabelEncoder#1987
ENH support multilabel targets in LabelEncoder#1987jnothman wants to merge 9 commits intoscikit-learn:masterfrom
Conversation
Also, support 1d-array of sequences as a multilabel format
This allows LabelBinarizer to take advantage of LabelEncoder efficiency, and for both to share future features.
|
I've edited |
|
(But apparently passing |
sklearn/utils/testing.py
Outdated
There was a problem hiding this comment.
I think that you mean
assert_array_equal(first_el, second_el,
'In sequence of sequence element {}'
'{}'.format(i, err_msg))
There was a problem hiding this comment.
You're quite right. Fixed.
|
This was PRed prematurely without tests, doc updates, etc. It is now ready for review. I can remove the inheritance of |
|
Why have you chosen to extend |
That's a good reason for me to take this bit out of this PR. One reason is a "label binarizer" IS a label encoder. Another reason is that there would be nothing in |
This reverts commit 8669605. Conflicts: sklearn/preprocessing.py sklearn/tests/test_preprocessing.py
|
Okay. That's been removed from this PR. We'll talk about it later ;) |
There was a problem hiding this comment.
What is the advantage of this new function compare to assert_equal?
There was a problem hiding this comment.
Could we do without and use only arrays with dtype=object? That way we make sure in the tests that we are manipulating arrays, and not lists.
There was a problem hiding this comment.
i have discussed arrays of dtype object below. assert_equal won't work if the outer or inner sequence is an array.
|
General remark about this PR: can we stick to arrays rather than lists? |
Please excuse me for not yet reading the detailed comments here. I think this and related issues are more important. In short: I wish. I had originally implemented this etc. as producing an array of arrays. But it turned out they were explicitly not supported by However, the problem is bigger than this. Firstly on the matter of arrays / not arrays: currently, we are making a distinction between arrays and array-likes in order to support multilabel target types. For example, to not be confused for a sequence of sequences, a label indicator matrix or multilabel-multioutputs must be an array. A sequence of sequences until now could not be constituted of arrays. I dislike this very much, but have been writing code to support it, naively assuming it was agreed on (somehow, some day, somewhere). However, the problem is bigger than this. Supporting the sequence of sequences multilabel type makes a mess. On top of the data type problem above, this is for three reasons:
In conclusion, I propose (@arjoly) that we abandon support for sequence of sequences-style multilabel data (except perhaps for import/export), and when we do choose to support it, to use Footnotes: (*) sparse def sparse_and(y1, y2):
y1 = y1.tocoo()
y2 = y2.tocoo()
y = y1.copy()
y.rows, y.cols = zip(*set(zip(y1.rows, y1.cols)) & set(zip(y2.rows, y2.cols)))
y.data = np.ones(y.cols.shape)
return yor arithmetically in def sparse_and(y1. y2):
y1 = y1.tocsr()
y2 = y2.tocsr()
y = (y1 + y2)
y.data[y.data < 2] = 0
y.eliminate_zeros()
return yAgain this illustrates that |
I should note that one way to keep a sparse multilabel format is to explicitly make them arrays of sets (which is what they should have been all along): >>> numpy.asarray([set([1])])
array([set([1])], dtype=object) |
|
[And if we go for supporting an Very, very roughly: class aos_matrix(numpy.ndarray):
"""Represents a sparse boolean matrix as an array of sets
"""
def __new__(self, sets, n_labels):
# TODO
@property
def shape(self):
return len(self._sets), self._n_labels
ndim = 1
@staticmethod
def _vectorize(func, otypes='O'):
vfunc = np.vectorize(func, otypes=otypes)
def wrapper(*args):
return vfunc(*[aos_matrix(arg) for arg in args])
return wrapper
__and__ = _vectorize(set.intersection)
__or__ = _vectorize(set.union)
__xor__ = _vectorize(set.symmetric_difference)
sizes = _vectorize(len, otypes='i')
def sum(self, axis=None):
if axis is None:
return self.sizes().sum()
elif axis == 1:
return self.sizes()
elif axis == 0:
res = np.zeros(self._n_labels)
for s in self:
res[s] += 1
else:
raise ValueErrorLet polymorphism handle the underlying data type. |
Also, support 1d-array of arrays (with the outer array having
dtype=object) as a multilabel format and theirnp.vectorized processing.This makes it easy to support multilabel data with string (or non-consecutive / negative-inclusive) labels. In particular,
bincountand related operations that can be used on LabelEncoded multiclass targets can also be used with multilabel data.Also added inheritance of
LabelBinarizerfromLabelEncoder.