Skip to content

[MRG+2] make_multilabel_classification for large n_features: faster and sparse output support#2828

Closed
jnothman wants to merge 2 commits intoscikit-learn:masterfrom
jnothman:make_multi_fast2
Closed

[MRG+2] make_multilabel_classification for large n_features: faster and sparse output support#2828
jnothman wants to merge 2 commits intoscikit-learn:masterfrom
jnothman:make_multi_fast2

Conversation

@jnothman
Copy link
Copy Markdown
Member

@jnothman jnothman commented Feb 5, 2014

This is a more focussed #2773.

make_multilabel_classification is currently very slow for large n_features, and only outputs dense matrices although it effectively generates a sparse structure.

Although this function can certainly be more optimised, the per-sample Python iteration will slow it down I don't think there's much value in further optimisation.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage remained the same when pulling f66a70e on jnothman:make_multi_fast2 into 74d3fcf on scikit-learn:master.

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Feb 5, 2014

@arjoly, you've basically already reviewed this. I just needed a clean sweep. I hope you don't mind giving it another scan.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a better name for n ?

@jnothman
Copy link
Copy Markdown
Member Author

Okay. I'll make the style changes you ask for.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a name clash with the original n_labels, maybe n_nonzero_labels would dot the job.

@jnothman
Copy link
Copy Markdown
Member Author

What did you want wrt the ECML citation?

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Feb 12, 2014

What did you want wrt the ECML citation?

Nothing, that was a reference to show that the benchmark I made is sensible.

@jnothman
Copy link
Copy Markdown
Member Author

Gotcha.

@jnothman
Copy link
Copy Markdown
Member Author

So does this get your +1, @arjoly?

On 13 February 2014 00:22, Arnaud Joly notifications@github.com wrote:

What did you want wrt the ECML citation?

Nothing, that was a reference to show that the benchmark I made is
sensible.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2828#issuecomment-34867796
.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Feb 12, 2014

+1 :-)
thanks for your work !!!

@jnothman
Copy link
Copy Markdown
Member Author

Thanks.

On 13 February 2014 08:54, Arnaud Joly notifications@github.com wrote:

+1 :-)

Reply to this email directly or view it on GitHubhttps://github.com//pull/2828#issuecomment-34922644
.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here the output type should be detailed a bit. Reading this docstring, I might get things wrong. Or maybe an example in the docstring? I am trying to see how we can avoid users getting confused and getting this function wrong.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, would a correct description of this function be that it transforms a multi-label 'y' into a multi-output 'y' by performing a one-hot encoding? It's a description full of jargon, so it shouldn't be the only description, but if it is indeed correct, it might help some users understanding the function.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thos PR is about whether the returned X is sparse or not. There's no one-hot encoding. You're right, however, that the Y description is a bit lacking (but the return_indicator parameter isn't terrible).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the return_indicator parameter isn't terrible

I would happily drop support for the support of tuple of list of label.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are open PRs which deprecate that support across the board, including here.

@GaelVaroquaux
Copy link
Copy Markdown
Member

General comment: it think that this feature should get more examples in the docstring and a bit more discussion in the narrative documentation. It can very well be hard to fit in the mind of someone who is not used to these patterns. Also, it would be easy for a user with these usecases to never find these functions.

Maybe also an additional example in the set of examples? For instance in the applications directory.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Mar 24, 2014

The narrative document about sample generators could indeed be improved.

@jnothman
Copy link
Copy Markdown
Member Author

Seeing as I've only really played with a sparse make_multilabel_classifcation for benchmarking, I'm not sure what to propose as an example. The matter was more that it had a read time scalability issues with large n_features.

I think you're right, though, that the sample generators are complex and magical, and could do with more in the narrative documentation.

@GaelVaroquaux
Copy link
Copy Markdown
Member

Seeing as I've only really played with a sparse make_multilabel_classifcation
for benchmarking, I'm not sure what to propose as an example.

Topic predictions in documents? Do we have a multi-label text dataset?

The reason that I think that an example is very important is that it will
help us identify whether our API makes sens in an application perspective
or not. I find 'example-driven-development' to be very important.

@jnothman
Copy link
Copy Markdown
Member Author

Do we have a multi-label text dataset?

We do, and there are no shortage of them available.

But make_multilabel_classification only very roughly approximates
something like this, because:

  • words should be distributed according to a power law
  • the artificial classes being generated don't share a "base vocabulary"
  • words shouldn't be sampled from each label uniformly given a document and
    its set of labels.

So I think part of the answer is: make_multilabel_classification doesn't
well reflect a real-world problem. I guess that means it doesn't make for a
very useful learning benchmark. Should we be trying to generate something
that looks more like real text data? How would we know that it does??

This PR doesn't aim to change that functionality, but at least makes it
operate in reasonable time given reasonable parameters.

On 24 March 2014 21:35, Gael Varoquaux notifications@github.com wrote:

Seeing as I've only really played with a sparse
make_multilabel_classifcation
for benchmarking, I'm not sure what to propose as an example.

Topic predictions in documents? Do we have a multi-label text dataset?

The reason that I think that an example is very important is that it will
help us identify whether our API makes sens in an application perspective
or not. I find 'example-driven-development' to be very important.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2828#issuecomment-38429831
.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Apr 1, 2014

General comment: it think that this feature should get more examples in the docstring and a bit more discussion in the narrative documentation. It can very well be hard to fit in the mind of someone who is not used to these patterns. Also, it would be easy for a user with these usecases to never find these functions.

Would it be enough to add one more example in the docstring?

Maybe also an additional example in the set of examples? For instance in the applications directory.

There is already one example using this dataset. Maybe this could be done in #3001.

This PR doesn't aim to change that functionality, but at least makes it
operate in reasonable time given reasonable parameters.

+1 Let's stay focus.

@arjoly arjoly mentioned this pull request May 19, 2014
3 tasks
@arjoly
Copy link
Copy Markdown
Member

arjoly commented May 19, 2014

any last review?

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jun 30, 2014

@hamsal Here you have a fast version of make_multilabel_classification for your benchmarks.

@hamsal hamsal mentioned this pull request Jun 30, 2014
6 tasks
@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jul 2, 2014

I did some quick histogram plots of features generated by this PR vs master and they look the same. +1 for merge!

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Jul 2, 2014

Are you saying they generate the same features? That's surprising to me? Or are you saying they generate the same character of dataset? Yes, they should. Whether it's a meaningful dataset is another matter. Thanks for the +1.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jul 2, 2014

Not exactly the same distribution but similarly looking histograms.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jul 8, 2014

Thus we are at +2 for merged.

@arjoly arjoly changed the title [MRG+1] make_multilabel_classification for large n_features: faster and sparse output support [MRG+2] make_multilabel_classification for large n_features: faster and sparse output support Jul 8, 2014
@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jul 8, 2014

Merged by rebase.
Thanks @jnothman !

@arjoly arjoly closed this Jul 8, 2014
@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jul 8, 2014

I should have check the build before pushing. :-/
I will send a fix.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jul 8, 2014

I have pushed a patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants