[MRG + 1] Fix for OvR partial_fit bug#7786
Conversation
|
LGTM, thanks :) |
sklearn/tests/test_multiclass.py
Outdated
| # A new class value which was not in the first call of partial_fit | ||
| # It should raise ValueError | ||
| y1 = [5] + y[7:-1] | ||
| assert_raises(ValueError, ovr.partial_fit, X=X[7:], y=y1) |
There was a problem hiding this comment.
Can you use assert_raises_regex to check the error message as well?
There was a problem hiding this comment.
but then you can drop this line.
|
Travis seems to fail... Could you fix that? |
|
@raghavrv It says |
|
I've not looked into it, but I've restarted the build for you... Let's see if it is some spurious error. |
|
Ok 👍 |
|
@raghavrv Thank you, Travis ran successfully. |
sklearn/multiclass.py
Outdated
| self.label_binarizer_ = LabelBinarizer(sparse_output=True) | ||
| self.label_binarizer_.fit(self.classes_) | ||
|
|
||
| if not set(self.classes_).issuperset(y): |
There was a problem hiding this comment.
for large y, iteration may be much slower than using np.setdiff1d..?
There was a problem hiding this comment.
Yes np.setdiff1d seems to be faster
There was a problem hiding this comment.
due to the use of np.setdiff1d it is not possible to partial_fit with a sparse y
sklearn/tests/test_multiclass.py
Outdated
| assert_equal(np.mean(pred == y), np.mean(pred1 == y)) | ||
|
|
||
| # Test when mini batches doesn't have all classes | ||
| # with SGDClassifier |
There was a problem hiding this comment.
As a reader of the test, it's not clear why you would want to test with both these base estimators.
Also: can we do it in a loop for clarity that they're the same test?
There was a problem hiding this comment.
Yes, happy for it to just be SGDClassifier
sklearn/tests/test_multiclass.py
Outdated
| # A new class value which was not in the first call of partial_fit | ||
| # It should raise ValueError | ||
| y1 = [5] + y[7:-1] | ||
| assert_raises(ValueError, ovr.partial_fit, X=X[7:], y=y1) |
There was a problem hiding this comment.
but then you can drop this line.
|
Thanks, @srivatsan-ramesh |
|
Sorry, I forgot to have you write an entry in |
|
@jnothman Created a PR! |
* mini-batch can now contain less number of classes than actual data * added tests where mini batches doesn't contain all classes
|
Due to Both the commit ed08d38 by @srivatsan-ramesh and the reviewed version by @jnothman 5fd925a throw an error with a sparse Or is there an alternative, if I use if if np.setdiff1d(y.indices, self.classes_) |
|
Hmmm the docstring does say that sparse matrices are accepted for I guess it is not really worth using a sparse matrix for |
|
sparse y may be used to represent multilabel problems. It could be accepted On 15 November 2016 at 21:42, Loïc Estève notifications@github.com wrote:
|
I see, it does look like we are missing tests for this then ... it would be good to do something similar to #7590 for multilabel problems, i.e. test that sparse and dense |
* mini-batch can now contain less number of classes than actual data * added tests where mini batches doesn't contain all classes
* mini-batch can now contain less number of classes than actual data * added tests where mini batches doesn't contain all classes
* mini-batch can now contain less number of classes than actual data * added tests where mini batches doesn't contain all classes
* mini-batch can now contain less number of classes than actual data * added tests where mini batches doesn't contain all classes
* mini-batch can now contain less number of classes than actual data * added tests where mini batches doesn't contain all classes
Reference Issue
PR #6239 and issue #6203
What does this implement/fix? Explain your changes.
partial_fit()of OvR was not working properly when the mini-batches did not contain all the classes.The
LabelBinarizer.fit()should be called withclasses_parameter and not theyparameter ofpartial_fit()And added tests to check the
partial_fit()function.Any other comments?
The PR #6239 did not seem to be the correct fix.