[MRG] ENH fast make_multilabel classification with sparse output support#2773
[MRG] ENH fast make_multilabel classification with sparse output support#2773jnothman wants to merge 8 commits intoscikit-learn:masterfrom
Conversation
|
Changed to MRG. |
|
At the same time, you could add a |
|
I had done so already, @arjoly, only I called it |
It wasn't clear. I mean to have a sparse label indicator |
I realise that's in store, but I consider it out of scope for this PR. And On 21 January 2014 20:31, Arnaud Joly notifications@github.com wrote:
|
There was a problem hiding this comment.
It's not clear that this is only related to the input.
|
What is the speed gain? |
There was a problem hiding this comment.
And this can certainly be vectorized. Sorry for not looking into this.
There was a problem hiding this comment.
Oh, now I remember why. It's not so easily vectorised because each is generated by a different class. I'm not sure if there's a nice way to do this.
|
Thanks for the comments; I've pushed some changes. I could vectorize sampling the number of classes and words, but that's relatively fast compared to the sampling of each word, so I doubt the reduced code clarity would be worth it. |
|
Until we've sorted out exactly how vectorised we want this, expect tests to fail :) The word generation can be vectorised, given that we're drawing each word uniformly from the sample's classes (not that this is mentioned in the docstring): we can draw from the summed probabilities. |
|
So the latest version is more vectorized, but is not a substantial improvement in terms of speed, IMO. I think the big per-feature expense of multinomial was a big issue. Now we're just playing with unnecessary tweaking. Which version do you think is better, @arjoly? |
There was a problem hiding this comment.
I am fan of this, except if there is a real gain.
There was a problem hiding this comment.
If the speed gain is small, I would plainly wrote the second alternative in the loop.
There was a problem hiding this comment.
Another option is to backport choice as in https://github.com/scikit-learn/scikit-learn/pull/2638/files
There was a problem hiding this comment.
I think it's fine to keep here, until we increase our numpy support. And given that we're iterating through each sample in Python, and this doesn't need to be a super-fast function, we might as well not backport it.
I think the last version is cleaner. Thanks ! Can you do some benchmark with some large n_samples, n_features and n_classes, e.g. n_samples=10 ** 4, n_features=10 ** 6, n_classes = 10 ** 3? |
|
A small benchmark script import argparse
import numpy as np
from sklearn.datasets import make_multilabel_classification
parser = argparse.ArgumentParser(description='Benchmark')
parser.add_argument('-s','--seed', type=int, default=None, nargs="?",
help="Random number generator seed")
parser.add_argument('-n', '--n_samples', type=int, default=10 ** 2, nargs="?")
parser.add_argument('-p', '--n_features', type=int, default=10 ** 3, nargs="?")
parser.add_argument('-d', '--n_classes', type=int, default=2 * 10 ** 2, nargs="?")
args = vars(parser.parse_args())
print(args)
make_multilabel_classification(n_samples=args["n_samples"],
n_features=args["n_features"], n_classes=args["n_classes"],
random_state=args["seed"])With the current branch, Thus that's pretty good :-) Now let's check what remains to optimize Thus the bottleneck is |
|
I have compared the two sample_classes implementation and the one, without choice, seems to be faster. |
|
with inline choice: with inline old sampling for y |
|
It's possible to gain some more cpu cycle with |
There was a problem hiding this comment.
Replacing this line by
p_w_sample = p_w_c.take(y, axis=1).sum(axis=1)
leads to a significant speed up.
There was a problem hiding this comment.
And no doubt a little more if mode='clip'
|
The last line by line benchmark gave There isn't much to optimize. |
|
With master With the last bit of optimization, I got Thus, we have (231.37 + 1.62) / (7.38 + 0.63) = 29.0... speed up ! Great thanks :-) |
|
But you're making far too much effort to optimise something that doesn't need to be super-fast. I'll prefer readable code here than any optimisation -- as long as the time is reasonable, which the version at master is not for high Please, don't profile this function. It's absolutely the wrong place to put effort into small gains. |
Ok, then can you benchmark with I agree that I prefer to focus on bottleneck by line profiling. |
On 27 January 2014 01:40, Arnaud Joly notifications@github.com wrote:
|
If you prefer ecc5876, then revert to this state. We'll continue the discussion from that point. I would keep the bit of optimization that aren't intrusive (preserve code readability) and maybe inline the nested function. |
I wanted sparse output from
make_multilabel_classification, but in settingn_featureshigh, I discovered quite how inefficientgenerator.multinomial(1, ...).argmax()is.