[MRG+1] deprecate sequences of sequences multilabel support#2657
[MRG+1] deprecate sequences of sequences multilabel support#2657jnothman wants to merge 16 commits intoscikit-learn:masterfrom
Conversation
|
I think sequences of sequences should also be deprecated from |
|
My reasoning is that we really don't want to have code like lying around. So we should deprecate sequences of sequences even in |
|
@mblondel, anything that uses is_sequence_of_sequences will produce a On Thu, Dec 12, 2013 at 6:28 PM, Coveralls notifications@github.com wrote:
|
|
But you're right that I've not checked parameters sections of comments. I shall do that. |
|
Also, I see that we will in the future not need a full |
|
Putting that PR on the 0.15 milestone as well. I will try to have a deeper look at it in the coming days. |
|
Ideally, this should be pulled in with the sparse label indicator support PRs. But at a minimum it would be nice to start warning that the end is nigh... |
|
So OneVsRestClassifier's support for label indicator matrices relies on the fact that LabelBinarizer is a no-op when it is fed with a 2d y right? I was thinking this behavior should probably go too if multi-label support is removed from LabelBinarizer. |
Note that this is no longer the case since #1993 was fixed. |
Perhaps. I guess I'd need to review where Is it possible to consider the changes here and then work out what we want the function of |
Then multilabel classication with indicator matrix probably doesn't work yet in OneVsRestClassifier. It would be nice to test this. |
I don't understand why you think so. If I understand you correctly, this is tested at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tests/test_multiclass.py#L94 |
|
It's just that if the no-op behavior was removed from LabelBinarizer, I don't see how this line can work in the multilabel case with indicator matrix: |
|
That's because the no-op behaviour was removed because it wasn't affected On Tue, Dec 17, 2013 at 6:42 PM, Mathieu Blondel
|
|
I think your implication is right, @mblondel: that |
Having said this, I'm not really sure what the best way to sort out the various uses of label binarizer is, and whether it's worth having a utility that will binarize multiclass data, but pass through -- or change the negative value on -- an array that is already binarized (i.e. multilabel). |
|
@mblondel, I am no longer convinced that this is the right place to deprecate multilabel support in As a result, I have changed this PR back to [MRG] and welcome reviews. (ping @arjoly, I know that you're working in related space) |
|
Can you rebase on top of master? I will try to have a look at this in the coming days. |
|
edit: Ok, you used is_sequence to perform the deprecation. |
There was a problem hiding this comment.
Look like a hack. Can you comment what you are doing here?
There was a problem hiding this comment.
It is a hack; one that I've already used in CountVectorizer. I've commented it two lines prior, but maybe putting the following in utils is better:
class AutoIncrementer(dict):
def __missing__(self, key):
out = len(self)
self[key] = out
return outWDYT?
There was a problem hiding this comment.
For what it's worth, this solution is almost twice as slow as the defaultdict hack, but I don't expect this assignment to dominate anywhere we use it:
%timeit a=AutoIncrementer(); sum(a[k] for k in range(100000))
10 loops, best of 3: 66.1 ms per loop
There was a problem hiding this comment.
I don't have any strong opinion on this. Let's keep this version.
There was a problem hiding this comment.
If there's a clear name for it, I'd be happy to refactor some utilities relating to this indexation or vocabulary construct (mapping from objects to contiguous integers from 0 to n_objects - 1). The operations that commonly happen are: create a dict that autoincrements; map that dict to an equivalent array/list; map back the other direction. I'm not sure if these would benefit from being abstracted behind functions.
Yet, for that second function, for example, we have used:
l = sorted(d, key=d.__getitem__)or similar to convert to a list, but the following is faster (linear complexity; faster benchmark):
l = np.array(len(d), dtype=object)
keys, vals = zip(*six.iteritems(d))
l[vals] = keysso putting it behind a named function might make sense.
There was a problem hiding this comment.
If those patterns are repeated several times, it would be nice to make some code refactoring. But this would be the topic of another pr.
There was a problem hiding this comment.
Yes, and I've been wondering whether I should submit a patch to perform parallelised feature extraction for text which would probably incorporate these sorts of helpers.
|
While What do you think of |
sklearn/preprocessing/label.py
Outdated
There was a problem hiding this comment.
Can you add a small comment of what is _transform and class_mapping?
Also avoid stride trick
|
LGTM! Thanks @jnothman |
There was a problem hiding this comment.
Maybe it would be useful to add a note giving the version at which the MultiLabelBinarizer was added. This can be done using the versionadded markup in Sphinx: http://sphinx-doc.org/markup/para.html
There was a problem hiding this comment.
In this case the version will be 0.15.
|
Hi @jnothman, Does this PR need a rebase? If so Is it OK if I take the code here and rebase it on top of the current repository? I will soon be working on #2458 to fill in the missing pieces and be responsive to comments/reviews of the implementation. I would like to make this PR, which is pretty much complete, a prerequisite because it will be easier to deprecate sequence of sequences now before #2458 is implemented. Thanks! |
Github suggests there shouldn't be any conflicts for a merge/rebase. Ideally, we should just merge this PR ASAP, but it needs another review. |
|
A last review would be very nice ! |
|
Checked this out, and had to rebase with master. Once that was done, I see 1 failing doctest Can you rebase with master? This seems like a Very Good Thing (TM) |
There was a problem hiding this comment.
I would find it more natural to only use the object dtype for non integer y. That would fix the doctest failure.
There was a problem hiding this comment.
This could be implemented as (untested):
if all(isinstance(c, int) for c in classes):
dtype = np.int
else:
dtype = object
self.classes_ = np.asarray(classes, dtype=dtype)There was a problem hiding this comment.
As long as we can assume consistent typing from the first element of the iterable of iterables.
There was a problem hiding this comment.
Sorry, got confused about context.
|
I will merge #3246 as that includes the fix for the int dtype as soon as travis is green. |
|
Alright merged! Thanks you very much @jnothman for the fix. |
|
Great, this is finally done! Congratulation @jnothman for all your efforts! |
|
Thanks for pulling this through. |
|
And I'm looking forward to the 0.15 beta. Is there sense in trying to include in it some support for a sparse multilabel format (e.g. |
Towards #2270
type_of_target. Its helper,is_sequence_of_sequencestriggers the warning. A further warning applies tomake_multilabel_classification.sklearn.preprocessing.MultiLabelBinarizerThis will require a bit of updating once the alternative sparse matrix support in is incorporated from #2458 and https://github.com/jnothman/scikit-learn/tree/sparse_multi_metrics ... or vice-versa.