Skip to content

[MRG+2] Sparse label_binarizer#3203

Closed
hamsal wants to merge 41 commits intoscikit-learn:masterfrom
hamsal:sprs-lbl-bin
Closed

[MRG+2] Sparse label_binarizer#3203
hamsal wants to merge 41 commits intoscikit-learn:masterfrom
hamsal:sprs-lbl-bin

Conversation

@hamsal
Copy link
Copy Markdown
Contributor

@hamsal hamsal commented May 26, 2014

This is a continuation of a part of PR #2458 based on the code already written by @rsivapr and @arjoly from the branch https://github.com/arjoly/scikit-learn/tree/sparse-label_binarizer.

The first steps are to write tests to motivate correctness and then implement the appropriate changes in label.py.

Initial Checklist:

  • label validation on LabelBinarizer() - with testing
  • label validation on label_binarizer() - with testing
  • Import changes for label.py from arjolys branch sparse-label_binarizer
  • Import changes for test_label.py from arjolys branch sparse-label_binarizer
  • Handle case where type_of_target gets a sparse matrix input
  • Debug: One Label Tests in sklearn/tests/common.py
  • Test is_label_indicator_matrix with sparse input
  • Test label_binarize with binary one class case
  • Break down _inverse_label_binarize

Rebase Checklist:

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented May 27, 2014

I am trying to figure out how to handle the case where the input y to transform() or label_binarize() is a sparse matrix. This might call for extending sparse support for utils.multiclass.type_of_target() since it currently fails on sparse matrices.

@jnothman
Copy link
Copy Markdown
Member

Supporting sparse in type_of_target is definitely something that's needed,
and is present in Rohit's PR (and was copied into my sparse metrics PR).
However, it's altogether not clear that LabelBinarizer should accept
binarized input in the first place, but there are arguments to do so.

On 28 May 2014 07:57, hamsal notifications@github.com wrote:

I am trying to figure out how to handle the case where the input y to
transform() or label_binarize() is a sparse matrix. This might call for
extending sparse support for utils.multiclass.type_of_target() since it
currently fails on sparse matrices.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3203#issuecomment-44340966
.

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented May 28, 2014

Sparse support for type_of_target has been included based on the comments in Rohit's PR. The support for sparse input as it is right now will really only serve to validate the input sparse matrix as being correctly constructed there is no conversion from any format that is not multilabel-indicator to multilabel-indicator. This does seem more correct then failing on the input when it is a sparse multilabel-indicator matrix.

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented May 28, 2014

I am getting some numerical precision issues in the doctests that I think may be unrelated to the changes of this PR. Advice on how to proceed would be appreciated. I am running on a 64 bit machine but the numbers calculated contain fewer significant figures.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented May 28, 2014

If i understand correctly the failure, a piece od code uses the label binarizer with the multilabel argument. But you seprecatr this parameter. I would simply correct the function call in the appropriate place.

@hamsal hamsal closed this May 28, 2014
@hamsal hamsal reopened this May 28, 2014
@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented May 28, 2014

The multilabel error has been fixed, the precision errors have come back on the Travis CI.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented May 29, 2014

There is probably a change in the dtype which cause the small difference.
If you don't find the cause, you can add the ellipsis directive in the docstring.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this parameter sparse_output to be consistent with other module. Can you add also a comma after boolean?

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-5.08%) when pulling b4db6f2 on hamsal:sprs-lbl-bin into 1e64b7b on scikit-learn:master.

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented May 29, 2014

Thanks for the point to the ellipsis directive! The sparse parameter has been renamed sparse_output. The unnecessary helper function was removed, and dependencies on self.multilabel_ were removed.

The test coverage needs to be improved.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid doing any kind of magic with the label inference. What is the current behavior?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this was revised in a recent commit, it was based on some testing error.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented May 29, 2014

Hm apparently you deleted and added some files.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented May 29, 2014

You forgot the ellipsis ... :-)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need tests for this.

@hamsal hamsal closed this May 29, 2014
@hamsal hamsal reopened this May 29, 2014
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-5.08%) when pulling b4db6f2 on hamsal:sprs-lbl-bin into 1e64b7b on scikit-learn:master.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-2.66%) when pulling 69cd58e on hamsal:sprs-lbl-bin into 1e64b7b on scikit-learn:master.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented May 29, 2014

To undo properly the file deletion, you can use:

$ git filter-branch --force --index-filter 'git checkout master path/to/file' --prune-empty master..this-branch

But use with care, it rewrite the history of the branch.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented May 29, 2014

I don't understand how you get -2.66% of decreased coverage.

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented May 29, 2014

I did not change code in my last commit and yet coverage changed significantly which is curious.

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented May 29, 2014

About the files, they were not deletions they were changes of the access mode in the OS. I accidentally did git add on those four files so the access mode change appeared in the differential, I undid the changes to the files and I believe the situation is OK now.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-2.66%) when pulling c99f1b9 on hamsal:sprs-lbl-bin into 1e64b7b on scikit-learn:master.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jun 25, 2014

Apparently, you have merge conflicts.

Looks good to me, switch the title to MRG+1

@arjoly arjoly changed the title [MRG] Sparse label_binarizer [MRG+1] Sparse label_binarizer Jun 25, 2014
@jnothman
Copy link
Copy Markdown
Member

Yes, we I think I am also +1 for merge. Thank you @hamsal.

One small comment: there are a few tests, in this PR and elsewhere, that iterate through a number of sparse formats. In a new PR after this one is merged, could we please make this list a constant in sklearn.utils.testing? Alternatively, it may be a better idea to just use one format (e.g. CSR) and then exploit the .asformat method of sparse matrices, as (almost) in sklearn/decomposition/tests/test_truncated_svd.py. In any case, the set of supported matrix formats should be centralised.

@jnothman jnothman changed the title [MRG+1] Sparse label_binarizer [MRG+2] Sparse label_binarizer Jun 25, 2014
@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jun 25, 2014

One small comment: there are a few tests, in this PR and elsewhere, that iterate through a number of sparse formats. In a new PR after this one is merged, could we please make this list a constant in sklearn.utils.testing? Alternatively, it may be a better idea to just use one format (e.g. CSR) and then exploit the .asformat method of sparse matrices, as (almost) in sklearn/decomposition/tests/test_truncated_svd.py. In any case, the set of supported matrix formats should be centralised.

+1 for your proposition @jnothman. Can you open an issue or make a pull request?
edit (I mean for the list of sparse format)

@jnothman
Copy link
Copy Markdown
Member

merging

@jnothman jnothman closed this Jun 25, 2014
@jnothman
Copy link
Copy Markdown
Member

merged as 2bfca14

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented Jun 25, 2014

In my git rebases and history edits it looks like there are 3 commits in this PR that just came into the PR after the most recent rebase. So I don't think this was ready for a merge! Please look at the commit history and see that there are three commits shared by @jnothman and @arjoly.

Edit: These commits apparently also introduced the two new files in the files changed that have to do with the tree classifier.

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented Jun 25, 2014

I should have made it clearer that I was also attempting to fix errors, I do not think the most recent Travis passed.

@jnothman
Copy link
Copy Markdown
Member

Oh! Whoops... What do you mean by there being commits shared by jnothman and arjoly?

@jnothman
Copy link
Copy Markdown
Member

You are right a test fails. Could you please submit a patch?

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented Jun 25, 2014

The commit history has three commits, two from you and one from Arnaud.

Arnaud Joly arjoly MAINT reduce amount of boiler code using standard C operations bd7db7b

jnothman jnothman DOC give example of binarizing binary targets … eec979c

jnothman jnothman DOC Note shape of binary binarized output 34872e9

Yes I can submit a patch for the Travis error, I am not sure what to do about these stary commits or if they are a problem.

@jnothman
Copy link
Copy Markdown
Member

don't worry about that. it's squashed to one commit.

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented Jun 25, 2014

Ok good to know

@jnothman
Copy link
Copy Markdown
Member

But the test error makes it seem like the label binarizer is outputting non-integer labels. That should probably be locally tested rather than triggering an error in test_multiclass.py

@jnothman
Copy link
Copy Markdown
Member

I'm very sorry to have not properly tested before pushing, so hopefully we can fix this quickly.

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented Jun 25, 2014

This error has been confusing me because I can not recreate it locally, I have rebased and I have verified that my numpy is over 1.6.2 So for now the only place I have seen it is on Travis.

Edit: I have recreated it locally

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented Jun 25, 2014

I am working on the patch here #3314

I have fixed the error locally, Travis is building it now.

@hamsal
Copy link
Copy Markdown
Contributor Author

hamsal commented Jun 25, 2014

The patch has been merged, I apologize for the headache and thank you for the reviews!

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jun 26, 2014

Thanks for the good work! 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants