[MRG+2] Sparse label_binarizer by hamsal · Pull Request #3203 · scikit-learn/scikit-learn

hamsal · 2014-05-26T14:56:40Z

This is a continuation of a part of PR #2458 based on the code already written by @rsivapr and @arjoly from the branch https://github.com/arjoly/scikit-learn/tree/sparse-label_binarizer.

The first steps are to write tests to motivate correctness and then implement the appropriate changes in label.py.

Initial Checklist:

label validation on LabelBinarizer() - with testing
label validation on label_binarizer() - with testing
Import changes for label.py from arjolys branch sparse-label_binarizer
Import changes for test_label.py from arjolys branch sparse-label_binarizer
Handle case where type_of_target gets a sparse matrix input
Debug: One Label Tests in sklearn/tests/common.py
Test is_label_indicator_matrix with sparse input
Test label_binarize with binary one class case
Break down _inverse_label_binarize

Rebase Checklist:

Rebase on top of PR [MRG+1] deprecate sequences of sequences multilabel support #2657 changes in label.py
Rebase on top of PR [MRG+1] deprecate sequences of sequences multilabel support #2657 changes utils/multiclass.py
Reintegrate sparse_output tests in test_label.py
Reintegrate sparse_output tests in utils/test_multiclass.py
Reintegrate coverage boosting tests in test_label.py, test_multiclass.py
Include changes to Misc. files guided by make errors
Documentation updates
Deprecate parameters in lable.py as in previous changes

hamsal · 2014-05-27T21:57:35Z

I am trying to figure out how to handle the case where the input y to transform() or label_binarize() is a sparse matrix. This might call for extending sparse support for utils.multiclass.type_of_target() since it currently fails on sparse matrices.

jnothman · 2014-05-27T22:06:35Z

Supporting sparse in type_of_target is definitely something that's needed,
and is present in Rohit's PR (and was copied into my sparse metrics PR).
However, it's altogether not clear that LabelBinarizer should accept
binarized input in the first place, but there are arguments to do so.

On 28 May 2014 07:57, hamsal notifications@github.com wrote:

I am trying to figure out how to handle the case where the input y to
transform() or label_binarize() is a sparse matrix. This might call for
extending sparse support for utils.multiclass.type_of_target() since it
currently fails on sparse matrices.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3203#issuecomment-44340966
.

hamsal · 2014-05-28T14:23:53Z

Sparse support for type_of_target has been included based on the comments in Rohit's PR. The support for sparse input as it is right now will really only serve to validate the input sparse matrix as being correctly constructed there is no conversion from any format that is not multilabel-indicator to multilabel-indicator. This does seem more correct then failing on the input when it is a sparse multilabel-indicator matrix.

hamsal · 2014-05-28T20:29:01Z

I am getting some numerical precision issues in the doctests that I think may be unrelated to the changes of this PR. Advice on how to proceed would be appreciated. I am running on a 64 bit machine but the numbers calculated contain fewer significant figures.

arjoly · 2014-05-28T20:39:06Z

If i understand correctly the failure, a piece od code uses the label binarizer with the multilabel argument. But you seprecatr this parameter. I would simply correct the function call in the appropriate place.

hamsal · 2014-05-28T21:27:16Z

The multilabel error has been fixed, the precision errors have come back on the Travis CI.

arjoly · 2014-05-29T09:43:24Z

There is probably a change in the dtype which cause the small difference.
If you don't find the cause, you can add the ellipsis directive in the docstring.

arjoly · 2014-05-29T09:45:26Z

sklearn/preprocessing/label.py

Can you rename this parameter sparse_output to be consistent with other module. Can you add also a comma after boolean?

coveralls · 2014-05-29T15:29:06Z

Coverage decreased (-5.08%) when pulling b4db6f2 on hamsal:sprs-lbl-bin into 1e64b7b on scikit-learn:master.

hamsal · 2014-05-29T15:32:08Z

Thanks for the point to the ellipsis directive! The sparse parameter has been renamed sparse_output. The unnecessary helper function was removed, and dependencies on self.multilabel_ were removed.

The test coverage needs to be improved.

arjoly · 2014-05-29T15:56:22Z

sklearn/preprocessing/label.py

I would avoid doing any kind of magic with the label inference. What is the current behavior?

Ok this was revised in a recent commit, it was based on some testing error.

arjoly · 2014-05-29T16:02:49Z

Hm apparently you deleted and added some files.

arjoly · 2014-05-29T16:17:24Z

You forgot the ellipsis ... :-)

arjoly · 2014-05-29T16:22:29Z

sklearn/utils/multiclass.py

You need tests for this.

coveralls · 2014-05-29T17:07:31Z

Coverage decreased (-5.08%) when pulling b4db6f2 on hamsal:sprs-lbl-bin into 1e64b7b on scikit-learn:master.

coveralls · 2014-05-29T18:09:07Z

Coverage decreased (-2.66%) when pulling 69cd58e on hamsal:sprs-lbl-bin into 1e64b7b on scikit-learn:master.

arjoly · 2014-05-29T20:28:53Z

To undo properly the file deletion, you can use:

$ git filter-branch --force --index-filter 'git checkout master path/to/file' --prune-empty master..this-branch

But use with care, it rewrite the history of the branch.

arjoly · 2014-05-29T20:29:56Z

I don't understand how you get -2.66% of decreased coverage.

hamsal · 2014-05-29T20:58:21Z

I did not change code in my last commit and yet coverage changed significantly which is curious.

hamsal · 2014-05-29T21:23:24Z

About the files, they were not deletions they were changes of the access mode in the OS. I accidentally did git add on those four files so the access mode change appeared in the differential, I undid the changes to the files and I believe the situation is OK now.

coveralls · 2014-05-29T21:34:09Z

Coverage decreased (-2.66%) when pulling c99f1b9 on hamsal:sprs-lbl-bin into 1e64b7b on scikit-learn:master.

… and made formatting and wording revisions

…rse agrmax, and (row_max.ravel() <= 0) -> (row_max.ravel() == 0)

arjoly · 2014-06-25T19:51:40Z

Apparently, you have merge conflicts.

Looks good to me, switch the title to MRG+1

jnothman · 2014-06-25T20:06:01Z

Yes, we I think I am also +1 for merge. Thank you @hamsal.

One small comment: there are a few tests, in this PR and elsewhere, that iterate through a number of sparse formats. In a new PR after this one is merged, could we please make this list a constant in sklearn.utils.testing? Alternatively, it may be a better idea to just use one format (e.g. CSR) and then exploit the .asformat method of sparse matrices, as (almost) in sklearn/decomposition/tests/test_truncated_svd.py. In any case, the set of supported matrix formats should be centralised.

arjoly · 2014-06-25T20:08:31Z

One small comment: there are a few tests, in this PR and elsewhere, that iterate through a number of sparse formats. In a new PR after this one is merged, could we please make this list a constant in sklearn.utils.testing? Alternatively, it may be a better idea to just use one format (e.g. CSR) and then exploit the .asformat method of sparse matrices, as (almost) in sklearn/decomposition/tests/test_truncated_svd.py. In any case, the set of supported matrix formats should be centralised.

+1 for your proposition @jnothman. Can you open an issue or make a pull request?
edit (I mean for the list of sparse format)

jnothman · 2014-06-25T20:16:05Z

merging

jnothman · 2014-06-25T20:16:36Z

merged as 2bfca14

hamsal · 2014-06-25T20:19:59Z

In my git rebases and history edits it looks like there are 3 commits in this PR that just came into the PR after the most recent rebase. So I don't think this was ready for a merge! Please look at the commit history and see that there are three commits shared by @jnothman and @arjoly.

Edit: These commits apparently also introduced the two new files in the files changed that have to do with the tree classifier.

hamsal · 2014-06-25T20:23:27Z

I should have made it clearer that I was also attempting to fix errors, I do not think the most recent Travis passed.

jnothman · 2014-06-25T20:28:33Z

Oh! Whoops... What do you mean by there being commits shared by jnothman and arjoly?

jnothman · 2014-06-25T20:29:34Z

You are right a test fails. Could you please submit a patch?

hamsal · 2014-06-25T20:29:51Z

The commit history has three commits, two from you and one from Arnaud.

Arnaud Joly arjoly MAINT reduce amount of boiler code using standard C operations bd7db7b

jnothman jnothman DOC give example of binarizing binary targets … eec979c

jnothman jnothman DOC Note shape of binary binarized output 34872e9

Yes I can submit a patch for the Travis error, I am not sure what to do about these stary commits or if they are a problem.

jnothman · 2014-06-25T20:30:55Z

don't worry about that. it's squashed to one commit.

hamsal · 2014-06-25T20:31:12Z

Ok good to know

jnothman · 2014-06-25T20:32:20Z

But the test error makes it seem like the label binarizer is outputting non-integer labels. That should probably be locally tested rather than triggering an error in test_multiclass.py

jnothman · 2014-06-25T20:33:25Z

I'm very sorry to have not properly tested before pushing, so hopefully we can fix this quickly.

hamsal · 2014-06-25T20:35:22Z

This error has been confusing me because I can not recreate it locally, I have rebased and I have verified that my numpy is over 1.6.2 So for now the only place I have seen it is on Travis.

Edit: I have recreated it locally

hamsal · 2014-06-25T20:46:19Z

I am working on the patch here #3314

I have fixed the error locally, Travis is building it now.

hamsal · 2014-06-25T22:12:54Z

The patch has been merged, I apologize for the headache and thank you for the reviews!

arjoly · 2014-06-26T05:45:02Z

Thanks for the good work! 🍻

hamsal closed this May 28, 2014

hamsal reopened this May 28, 2014

arjoly reviewed May 29, 2014
View reviewed changes

sklearn/utils/multiclass.py Outdated

Copy link
Copy Markdown

Member

arjoly May 29, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need tests for this.

hamsal closed this May 29, 2014

hamsal reopened this May 29, 2014

Hamzeh Alsalhi added 9 commits June 25, 2014 15:27

Testing revisions for coverage in labels _inverse_binarize_multiclass()

f6384fe

Pep8 fixes

819443d

Sparse argmax implementation in label.py revised

0c26966

Replaced scipy saprse .max with backported function from utils fixes

d40e6d9

Formatting revisions to _inverse_binarize_multiclass

1fd993d

Simplified loop expression in _inverse_binarize_multiclass

594a84f

Travis restart

e4708f9

Resued row_nnz in the loop expression in _inverse_binarize_multiclass…

6769d49

… and made formatting and wording revisions

Noted the purpose of the code in _inverse_binarize_multiclass for spa…

abed94d

…rse agrmax, and (row_max.ravel() <= 0) -> (row_max.ravel() == 0)

arjoly changed the title ~~[MRG] Sparse label_binarizer~~ [MRG+1] Sparse label_binarizer Jun 25, 2014

jnothman changed the title ~~[MRG+1] Sparse label_binarizer~~ [MRG+2] Sparse label_binarizer Jun 25, 2014

jnothman closed this Jun 25, 2014

Uh oh!

Conversation

hamsal commented May 26, 2014

Uh oh!

hamsal commented May 27, 2014

Uh oh!

jnothman commented May 27, 2014

Uh oh!

hamsal commented May 28, 2014

Uh oh!

hamsal commented May 28, 2014

Uh oh!

arjoly commented May 28, 2014

Uh oh!

hamsal commented May 28, 2014

Uh oh!

arjoly commented May 29, 2014

Uh oh!

arjoly May 29, 2014

Choose a reason for hiding this comment

Uh oh!

coveralls commented May 29, 2014

Uh oh!

hamsal commented May 29, 2014

Uh oh!

arjoly May 29, 2014

Choose a reason for hiding this comment

Uh oh!

hamsal May 29, 2014

Choose a reason for hiding this comment

Uh oh!

arjoly commented May 29, 2014

Uh oh!

arjoly commented May 29, 2014

Uh oh!

arjoly May 29, 2014

Choose a reason for hiding this comment

Uh oh!

coveralls commented May 29, 2014

Uh oh!

coveralls commented May 29, 2014

Uh oh!

arjoly commented May 29, 2014

Uh oh!

arjoly commented May 29, 2014

Uh oh!

hamsal commented May 29, 2014

Uh oh!

hamsal commented May 29, 2014

Uh oh!

coveralls commented May 29, 2014

Uh oh!

arjoly commented Jun 25, 2014

Uh oh!

jnothman commented Jun 25, 2014

Uh oh!

arjoly commented Jun 25, 2014

Uh oh!

jnothman commented Jun 25, 2014

Uh oh!

jnothman commented Jun 25, 2014

Uh oh!

hamsal commented Jun 25, 2014

Uh oh!

hamsal commented Jun 25, 2014

Uh oh!

jnothman commented Jun 25, 2014

Uh oh!

jnothman commented Jun 25, 2014

Uh oh!

hamsal commented Jun 25, 2014

Uh oh!

jnothman commented Jun 25, 2014

Uh oh!

hamsal commented Jun 25, 2014

Uh oh!

jnothman commented Jun 25, 2014

Uh oh!

jnothman commented Jun 25, 2014

Uh oh!