ENH support multilabel targets in LabelEncoder by jnothman · Pull Request #1987 · scikit-learn/scikit-learn

jnothman · 2013-05-22T09:40:10Z

Also, support 1d-array of arrays (with the outer array having dtype=object) as a multilabel format and their np.vectorized processing.

This makes it easy to support multilabel data with string (or non-consecutive / negative-inclusive) labels. In particular, bincount and related operations that can be used on LabelEncoded multiclass targets can also be used with multilabel data.

Also added inheritance of LabelBinarizer from LabelEncoder.

Also, support 1d-array of sequences as a multilabel format

This allows LabelBinarizer to take advantage of LabelEncoder efficiency, and for both to share future features.

jnothman · 2013-05-23T07:49:29Z

I've edited LabelBinarizer to have LabelEncoder as a parent class. This reduces duplicate implementation of class-to-index stuff, which means adding a classes parameter to LabelBinarizer affects both (c.f. #1643).

jnothman · 2013-05-23T08:40:56Z

(But apparently passing test_processing doesn't mean the changes pass all tests...)

arjoly · 2013-05-26T19:47:31Z

sklearn/utils/testing.py

I think that you mean

assert_array_equal(first_el, second_el, 'In sequence of sequence element {}' '{}'.format(i, err_msg))

You're quite right. Fixed.

jnothman · 2013-05-28T03:51:10Z

This was PRed prematurely without tests, doc updates, etc.

It is now ready for review.

I can remove the inheritance of LabelBinarizer from LabelEncoder if that is deemed wise.

arjoly · 2013-05-28T06:25:17Z

Why have you chosen to extend LabelEncoder instead of creating a base abstract classe?

jnothman · 2013-05-28T06:33:29Z

Why have you chosen to extend LabelEncoder instead of creating a base abstract classe?

That's a good reason for me to take this bit out of this PR. One reason is a "label binarizer" IS a label encoder. Another reason is that there would be nothing in LabelEncoder if there were an ABC. But I guess that happens elsewhere in scikit-learn.

This reverts commit 8669605. Conflicts: sklearn/preprocessing.py sklearn/tests/test_preprocessing.py

jnothman · 2013-05-28T06:38:34Z

Okay. That's been removed from this PR. We'll talk about it later ;)

arjoly · 2013-05-28T18:18:22Z

sklearn/utils/testing.py

What is the advantage of this new function compare to assert_equal?

Could we do without and use only arrays with dtype=object? That way we make sure in the tests that we are manipulating arrays, and not lists.

i have discussed arrays of dtype object below. assert_equal won't work if the outer or inner sequence is an array.

GaelVaroquaux · 2013-05-28T20:26:14Z

General remark about this PR: can we stick to arrays rather than lists?

jnothman · 2013-05-28T22:33:56Z

General remark about this PR: can we stick to arrays rather than lists?

Please excuse me for not yet reading the detailed comments here. I think this and related issues are more important.

In short: I wish. I had originally implemented this etc. as producing an array of arrays. But it turned out they were explicitly not supported by is_multilabel. (In the end, I modify is_multilabel anyway to accept sequences of arrays, and produce a list of arrays because vectorize was only more obfuscated than map.) One problem with arrays in this context is that if we happen to have a sequence of sequences that are all the same length, calling np.asarray will transform them into a 2d array, not a 1d array of objects, and dtype=object doesn't help. You have to go: y_out = np.zeros(len(y), dtype=object); y_out[:] = y. Even though we may provide a helper function (jnothman@96dbd77#L1R141), this behaviour potentially throws up pitfalls for users, so I think it's better to explicitly work with lists so users don't think it's safe to make their own arrays from an arbitrary sequence of sequences.

However, the problem is bigger than this. Firstly on the matter of arrays / not arrays: currently, we are making a distinction between arrays and array-likes in order to support multilabel target types. For example, to not be confused for a sequence of sequences, a label indicator matrix or multilabel-multioutputs must be an array. A sequence of sequences until now could not be constituted of arrays. I dislike this very much, but have been writing code to support it, naively assuming it was agreed on (somehow, some day, somewhere).

However, the problem is bigger than this. Supporting the sequence of sequences multilabel type makes a mess. On top of the data type problem above, this is for three reasons:

multilabel support is in its early days; aren't we better off just supporting one (dense) format until we're happy with that? how many classes (and samples; at what sparsity) do we need to make s-o-s a substantially more efficient choice than a label indicator matrix?
everything supporting multilabel targets is written in duplicate and it's a pain to manage.
they are essentially just sparse lil matrices with all(data == 1). We are reimplementing the wheel, without scipy.sparse's tested, ?efficient, data structure-abstracted (lil is often the wrong choice for our operations) implementations of sum, transpose, etc. (And lil encapsulates the handling of array(dtype=object) to avoid those pitfalls). One problem with using scipy.sparse now is its poor support for boolean operations (support sparse boolean matrices and logical operations (Trac #639) scipy/scipy#1166). I imagine that for metrics we can get by with sum, __and__ and __xor__; we would need to implement the last two ourselves(). Nonetheless, like sparse support for X, I think we should only *gradually support sparse multilabel y.

In conclusion, I propose (@arjoly) that we abandon support for sequence of sequences-style multilabel data (except perhaps for import/export), and when we do choose to support it, to use scipy.sparse (**).

Footnotes:

(*) sparse __and__ and __xor__ (assuming all data is 1) can be implemented easily using set operations over coo matrices:

def sparse_and(y1, y2):
    y1 = y1.tocoo()
    y2 = y2.tocoo()
    y = y1.copy()
    y.rows, y.cols = zip(*set(zip(y1.rows, y1.cols)) & set(zip(y2.rows, y2.cols)))
    y.data = np.ones(y.cols.shape)
    return y

or arithmetically in csc/r:

def sparse_and(y1. y2):
    y1 = y1.tocsr()
    y2 = y2.tocsr()
    y = (y1 + y2)
    y.data[y.data < 2] = 0
    y.eliminate_zeros()
    return y

Again this illustrates that lil is often not our best choice.
(**) with the annoyance that data filled with 1 or True will be stored unnecessarily.

jnothman · 2013-05-28T23:09:00Z

can we stick to arrays rather than lists?

if we happen to have a sequence of sequences that are all the same length, calling np.asarray will transform them into a 2d array, not a 1d array of objects

I should note that one way to keep a sparse multilabel format is to explicitly make them arrays of sets (which is what they should have been all along):

>>> numpy.asarray([set([1])])
array([set([1])], dtype=object)

jnothman · 2013-05-28T23:58:25Z

[And if we go for supporting an array of sets, it should be wrapped by an object that supports the methods we care about from the label indicator matrix, essentially reimplementing a special case of scipy.sparse:

Very, very roughly:

class aos_matrix(numpy.ndarray):
    """Represents a sparse boolean matrix as an array of sets
    """
    def __new__(self, sets, n_labels):
        # TODO

    @property
    def shape(self):
        return len(self._sets), self._n_labels

    ndim = 1

    @staticmethod
    def _vectorize(func, otypes='O'):
        vfunc = np.vectorize(func, otypes=otypes)
        def wrapper(*args):
            return vfunc(*[aos_matrix(arg) for arg in args])
        return wrapper

    __and__ = _vectorize(set.intersection)
    __or__ = _vectorize(set.union)
    __xor__ = _vectorize(set.symmetric_difference)
    sizes = _vectorize(len, otypes='i')

    def sum(self, axis=None):
        if axis is None:
            return self.sizes().sum()
        elif axis == 1:
            return self.sizes()
        elif axis == 0:
            res = np.zeros(self._n_labels)
            for s in self:
                res[s] += 1
        else:
            raise ValueError

Let polymorphism handle the underlying data type.
]

ENH support multilabel targets in LabelEncoder

96dbd77

Also, support 1d-array of sequences as a multilabel format

jnothman mentioned this pull request May 22, 2013

[WIP] rewrite precision_recall_fscore_support #1990

Closed

ENH Make LabelEncoder a parent of LabelBinarizer

8669605

This allows LabelBinarizer to take advantage of LabelEncoder efficiency, and for both to share future features.

Fix tests

82073e4

arjoly reviewed May 26, 2013
View reviewed changes

Fix typo in assert_sequences_equal

e162bc1

jnothman mentioned this pull request May 27, 2013

ENH Special-case where LabelEncoder need not copy #2006

Closed

jnothman added 4 commits May 28, 2013 09:45

Ensure sorted, uniqued comparisons in assert_sequences_equal

2eac97f

Remove multilabel_vectorize

8085fb0

Again admit list of arrrays as sequence of sequences

5ddcf5c

Add tests and fix doc

507e848

Revert "ENH Make LabelEncoder a parent of LabelBinarizer"

07eebcf

This reverts commit 8669605. Conflicts: sklearn/preprocessing.py sklearn/tests/test_preprocessing.py

arjoly reviewed May 28, 2013
View reviewed changes

jnothman mentioned this pull request May 28, 2013

FIX helper to check multilabel types #1985

Closed

Uh oh!

Conversation

jnothman commented May 22, 2013

Uh oh!

jnothman commented May 23, 2013

Uh oh!

jnothman commented May 23, 2013

Uh oh!

arjoly May 26, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman May 26, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman commented May 28, 2013

Uh oh!

arjoly commented May 28, 2013

Uh oh!

jnothman commented May 28, 2013

Uh oh!

jnothman commented May 28, 2013

Uh oh!

arjoly May 28, 2013

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux May 28, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman May 29, 2013

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented May 28, 2013

Uh oh!

jnothman commented May 28, 2013

Uh oh!

jnothman commented May 28, 2013

Uh oh!

jnothman commented May 28, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants