Skip to content

ENH support multilabel targets in LabelEncoder#1987

Closed
jnothman wants to merge 9 commits intoscikit-learn:masterfrom
jnothman:multilabel_labelencoder
Closed

ENH support multilabel targets in LabelEncoder#1987
jnothman wants to merge 9 commits intoscikit-learn:masterfrom
jnothman:multilabel_labelencoder

Conversation

@jnothman
Copy link
Copy Markdown
Member

Also, support 1d-array of arrays (with the outer array having dtype=object) as a multilabel format and their np.vectorized processing.

This makes it easy to support multilabel data with string (or non-consecutive / negative-inclusive) labels. In particular, bincount and related operations that can be used on LabelEncoded multiclass targets can also be used with multilabel data.

Also added inheritance of LabelBinarizer from LabelEncoder.

Also, support 1d-array of sequences as a multilabel format
This allows LabelBinarizer to take advantage of LabelEncoder efficiency, and for both to share future features.
@jnothman
Copy link
Copy Markdown
Member Author

I've edited LabelBinarizer to have LabelEncoder as a parent class. This reduces duplicate implementation of class-to-index stuff, which means adding a classes parameter to LabelBinarizer affects both (c.f. #1643).

@jnothman
Copy link
Copy Markdown
Member Author

(But apparently passing test_processing doesn't mean the changes pass all tests...)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you mean

        assert_array_equal(first_el, second_el,
                           'In sequence of sequence element {}'
                           '{}'.format(i, err_msg))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're quite right. Fixed.

@jnothman
Copy link
Copy Markdown
Member Author

This was PRed prematurely without tests, doc updates, etc.

It is now ready for review.

I can remove the inheritance of LabelBinarizer from LabelEncoder if that is deemed wise.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented May 28, 2013

Why have you chosen to extend LabelEncoder instead of creating a base abstract classe?

@jnothman
Copy link
Copy Markdown
Member Author

Why have you chosen to extend LabelEncoder instead of creating a base abstract classe?

That's a good reason for me to take this bit out of this PR. One reason is a "label binarizer" IS a label encoder. Another reason is that there would be nothing in LabelEncoder if there were an ABC. But I guess that happens elsewhere in scikit-learn.

This reverts commit 8669605.

Conflicts:

	sklearn/preprocessing.py
	sklearn/tests/test_preprocessing.py
@jnothman
Copy link
Copy Markdown
Member Author

Okay. That's been removed from this PR. We'll talk about it later ;)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of this new function compare to assert_equal?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do without and use only arrays with dtype=object? That way we make sure in the tests that we are manipulating arrays, and not lists.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have discussed arrays of dtype object below. assert_equal won't work if the outer or inner sequence is an array.

@GaelVaroquaux
Copy link
Copy Markdown
Member

General remark about this PR: can we stick to arrays rather than lists?

@jnothman
Copy link
Copy Markdown
Member Author

General remark about this PR: can we stick to arrays rather than lists?

Please excuse me for not yet reading the detailed comments here. I think this and related issues are more important.

In short: I wish. I had originally implemented this etc. as producing an array of arrays. But it turned out they were explicitly not supported by is_multilabel. (In the end, I modify is_multilabel anyway to accept sequences of arrays, and produce a list of arrays because vectorize was only more obfuscated than map.) One problem with arrays in this context is that if we happen to have a sequence of sequences that are all the same length, calling np.asarray will transform them into a 2d array, not a 1d array of objects, and dtype=object doesn't help. You have to go: y_out = np.zeros(len(y), dtype=object); y_out[:] = y. Even though we may provide a helper function (jnothman@96dbd77#L1R141), this behaviour potentially throws up pitfalls for users, so I think it's better to explicitly work with lists so users don't think it's safe to make their own arrays from an arbitrary sequence of sequences.

However, the problem is bigger than this. Firstly on the matter of arrays / not arrays: currently, we are making a distinction between arrays and array-likes in order to support multilabel target types. For example, to not be confused for a sequence of sequences, a label indicator matrix or multilabel-multioutputs must be an array. A sequence of sequences until now could not be constituted of arrays. I dislike this very much, but have been writing code to support it, naively assuming it was agreed on (somehow, some day, somewhere).

However, the problem is bigger than this. Supporting the sequence of sequences multilabel type makes a mess. On top of the data type problem above, this is for three reasons:

  • multilabel support is in its early days; aren't we better off just supporting one (dense) format until we're happy with that? how many classes (and samples; at what sparsity) do we need to make s-o-s a substantially more efficient choice than a label indicator matrix?
  • everything supporting multilabel targets is written in duplicate and it's a pain to manage.
  • they are essentially just sparse lil matrices with all(data == 1). We are reimplementing the wheel, without scipy.sparse's tested, ?efficient, data structure-abstracted (lil is often the wrong choice for our operations) implementations of sum, transpose, etc. (And lil encapsulates the handling of array(dtype=object) to avoid those pitfalls). One problem with using scipy.sparse now is its poor support for boolean operations (support sparse boolean matrices and logical operations (Trac #639) scipy/scipy#1166). I imagine that for metrics we can get by with sum, __and__ and __xor__; we would need to implement the last two ourselves(). Nonetheless, like sparse support for X, I think we should only *gradually support sparse multilabel y.

In conclusion, I propose (@arjoly) that we abandon support for sequence of sequences-style multilabel data (except perhaps for import/export), and when we do choose to support it, to use scipy.sparse (**).


Footnotes:

(*) sparse __and__ and __xor__ (assuming all data is 1) can be implemented easily using set operations over coo matrices:

def sparse_and(y1, y2):
    y1 = y1.tocoo()
    y2 = y2.tocoo()
    y = y1.copy()
    y.rows, y.cols = zip(*set(zip(y1.rows, y1.cols)) & set(zip(y2.rows, y2.cols)))
    y.data = np.ones(y.cols.shape)
    return y

or arithmetically in csc/r:

def sparse_and(y1. y2):
    y1 = y1.tocsr()
    y2 = y2.tocsr()
    y = (y1 + y2)
    y.data[y.data < 2] = 0
    y.eliminate_zeros()
    return y

Again this illustrates that lil is often not our best choice.
(**) with the annoyance that data filled with 1 or True will be stored unnecessarily.

@jnothman
Copy link
Copy Markdown
Member Author

can we stick to arrays rather than lists?

if we happen to have a sequence of sequences that are all the same length, calling np.asarray will transform them into a 2d array, not a 1d array of objects

I should note that one way to keep a sparse multilabel format is to explicitly make them arrays of sets (which is what they should have been all along):

>>> numpy.asarray([set([1])])
array([set([1])], dtype=object)

@jnothman
Copy link
Copy Markdown
Member Author

[And if we go for supporting an array of sets, it should be wrapped by an object that supports the methods we care about from the label indicator matrix, essentially reimplementing a special case of scipy.sparse:

Very, very roughly:

class aos_matrix(numpy.ndarray):
    """Represents a sparse boolean matrix as an array of sets
    """
    def __new__(self, sets, n_labels):
        # TODO

    @property
    def shape(self):
        return len(self._sets), self._n_labels

    ndim = 1

    @staticmethod
    def _vectorize(func, otypes='O'):
        vfunc = np.vectorize(func, otypes=otypes)
        def wrapper(*args):
            return vfunc(*[aos_matrix(arg) for arg in args])
        return wrapper

    __and__ = _vectorize(set.intersection)
    __or__ = _vectorize(set.union)
    __xor__ = _vectorize(set.symmetric_difference)
    sizes = _vectorize(len, otypes='i')

    def sum(self, axis=None):
        if axis is None:
            return self.sizes().sum()
        elif axis == 1:
            return self.sizes()
        elif axis == 0:
            res = np.zeros(self._n_labels)
            for s in self:
                res[s] += 1
        else:
            raise ValueError

Let polymorphism handle the underlying data type.
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants