[MRG+2] make_multilabel_classification for large n_features: faster and sparse output support#2828
[MRG+2] make_multilabel_classification for large n_features: faster and sparse output support#2828jnothman wants to merge 2 commits intoscikit-learn:masterfrom
Conversation
…parse output support
|
@arjoly, you've basically already reviewed this. I just needed a clean sweep. I hope you don't mind giving it another scan. |
There was a problem hiding this comment.
Could you provide a better name for n ?
|
Okay. I'll make the style changes you ask for. |
There was a problem hiding this comment.
You have a name clash with the original n_labels, maybe n_nonzero_labels would dot the job.
|
What did you want wrt the ECML citation? |
Nothing, that was a reference to show that the benchmark I made is sensible. |
|
Gotcha. |
|
So does this get your +1, @arjoly? On 13 February 2014 00:22, Arnaud Joly notifications@github.com wrote:
|
|
+1 :-) |
|
Thanks. On 13 February 2014 08:54, Arnaud Joly notifications@github.com wrote:
|
There was a problem hiding this comment.
Maybe here the output type should be detailed a bit. Reading this docstring, I might get things wrong. Or maybe an example in the docstring? I am trying to see how we can avoid users getting confused and getting this function wrong.
There was a problem hiding this comment.
By the way, would a correct description of this function be that it transforms a multi-label 'y' into a multi-output 'y' by performing a one-hot encoding? It's a description full of jargon, so it shouldn't be the only description, but if it is indeed correct, it might help some users understanding the function.
There was a problem hiding this comment.
Thos PR is about whether the returned X is sparse or not. There's no one-hot encoding. You're right, however, that the Y description is a bit lacking (but the return_indicator parameter isn't terrible).
There was a problem hiding this comment.
but the return_indicator parameter isn't terrible
I would happily drop support for the support of tuple of list of label.
There was a problem hiding this comment.
There are open PRs which deprecate that support across the board, including here.
|
General comment: it think that this feature should get more examples in the docstring and a bit more discussion in the narrative documentation. It can very well be hard to fit in the mind of someone who is not used to these patterns. Also, it would be easy for a user with these usecases to never find these functions. Maybe also an additional example in the set of examples? For instance in the applications directory. |
|
The narrative document about sample generators could indeed be improved. |
|
Seeing as I've only really played with a sparse I think you're right, though, that the sample generators are complex and magical, and could do with more in the narrative documentation. |
Topic predictions in documents? Do we have a multi-label text dataset? The reason that I think that an example is very important is that it will |
We do, and there are no shortage of them available. But
So I think part of the answer is: make_multilabel_classification doesn't This PR doesn't aim to change that functionality, but at least makes it On 24 March 2014 21:35, Gael Varoquaux notifications@github.com wrote:
|
Would it be enough to add one more example in the docstring?
There is already one example using this dataset. Maybe this could be done in #3001.
+1 Let's stay focus. |
|
any last review? |
|
@hamsal Here you have a fast version of |
|
I did some quick histogram plots of features generated by this PR vs master and they look the same. +1 for merge! |
|
Are you saying they generate the same features? That's surprising to me? Or are you saying they generate the same character of dataset? Yes, they should. Whether it's a meaningful dataset is another matter. Thanks for the +1. |
|
Not exactly the same distribution but similarly looking histograms. |
|
Thus we are at +2 for merged. |
|
Merged by rebase. |
|
I should have check the build before pushing. :-/ |
|
I have pushed a patch. |
This is a more focussed #2773.
make_multilabel_classificationis currently very slow for largen_features, and only outputs dense matrices although it effectively generates a sparse structure.Although this function can certainly be more optimised, the per-sample Python iteration will slow it down I don't think there's much value in further optimisation.