MRG added classes parameter to LabelBinarizer#1643
MRG added classes parameter to LabelBinarizer#1643amueller wants to merge 17 commits intoscikit-learn:masterfrom
classes parameter to LabelBinarizer#1643Conversation
|
I think you need a couple of tests :) Please don't forget to test the exceptions, including the actual message value for the one that displays the class difference. |
|
Definitely will do ;) |
|
Done :) |
|
Renamed to MRG |
|
Looks good to me. Could you please rebase on top of master to fix travis. +1 for merging when travis is green. |
|
seems like I broke multilabel stuff.. no idea how. will fix now. |
|
Ok, fixed. cc @mblondel ;) |
|
This is not fully backward compatible now :-/ there was an attribute |
|
You should specify what type of representation it will lead if you specify the What is the use case of the |
Misunderstood the doc. Don't take this comment into account. |
|
I'm not sure I understand the multilabel question. It is for multi-label input with prespecified classes. |
|
I think that you need to add a constructor argument for the |
|
|
|
Should be good now :) |
|
Ok, travis is happy now.... |
|
I don't think you treat the case Instead of booleans, it may be better to use strings? For instance: Otherwise, it looks good ! |
|
There is a test here. Do you think it is sufficient? |
Sorry, but I messed when writing the preceeding sentence. I mean
From my point of view, the string format in the attribute should reduce the number of nested Note that it would be cool to support sparse binary indicator matrix which would bring many more sparse format (csc, csr, ...). Honestly, do as you think it's the best. |
sklearn/preprocessing.py
Outdated
There was a problem hiding this comment.
Would you just infer it on transform? I wanted to match the current behavior when fit is called. Currently it is enforced that the data in transform is of the same kind (multilable or not) then during fit.
There was a problem hiding this comment.
Yes, I think inferring it in transform would be enough. But I don't really see why users would want to fit multiclass data and transform multilabel data.
There was a problem hiding this comment.
Ok, then I would also change the behavior when using fit to have it unified. I'm not sure who wrote this code in the first place. I guess either you or @larsmans? I also don't see much of a reason for the current behavior actually...
There was a problem hiding this comment.
Hum... so what do I do if someone calls inverse_transform before calling either fit or transform?
There was a problem hiding this comment.
Ok, I think I'm giving up.
Don't give up !! this pr is very useful.
There was a problem hiding this comment.
Ok, I think I'm giving up.
I gave my opinion. Now as always in scikit-learn, the majority votes. If
more people agree with you, I will respect the majority's opinion.
As far as I see it, there are 2 different issues. Whether to use
constructor or method parameter, and whether to use one string parameter or
two boolean parameters.
I think at least we agree on the latter.
There was a problem hiding this comment.
My issue is more that I don't know what to do with the input validation in transform if I do a method parameter.
There was a problem hiding this comment.
On Tue, Feb 12, 2013 at 5:56 PM, Andreas Mueller
notifications@github.comwrote:
My issue is more that I don't know what to do with the input validation in
transform if I do a method parameter.
If the string takes the "auto" value by default, doesn't the same problem
exist for the constructor parameter too?
There was a problem hiding this comment.
The way it is implemented currently is that if fit is not called, there is a fall-back to multiclass. I.e. if you don't specify multi-label but pass a multi-label y to transform, an error is raised.
|
I replaced the two boolean arguments by a single string argument. I didn't adjust the tests yet. I did not replace the boolean attributes. Do you think I should do that, too? Replacing the I just also tried replacing the properties with strings and that led to one more level of @arjoly @mblondel which part did you expect to get simpler with string arguments? |
|
hm maybe I can fix it somehow... |
|
Ok, I think it is actually nicer now with strings :) |
|
@arjoly I introduced a new function |
|
Ok so I didn't deprecate the property |
|
This adds another property to the interface... |
sklearn/preprocessing.py
Outdated
|
Can you add a test to check that:
Can you also
|
|
Is there any other class except It is probably best if we do respect the ordering of the labels, as long as that is not incompatible with code in other places. |
|
The current code does a |
There was a problem hiding this comment.
Why is the case len(self.classes_) > 2 used here?
There was a problem hiding this comment.
because for two classes, the output is 1d.
|
Any news on this pr? |
|
sorry I'm swamped. I actually forgot the state of this. I'm trying to catch up currently. |
|
If I understand correctly, out-of- |
|
Also, I think parallel functionality should (as a rule) be available in |
Adds a
classesparameter toLabelBinarizer. I wanted that to addpartial_fitto the naive Bayes models.If
classesis specified, there is no need to callfit.This PR still needs tests.