Skip to content

[MRG] ENH fast make_multilabel classification with sparse output support#2773

Closed
jnothman wants to merge 8 commits intoscikit-learn:masterfrom
jnothman:make_multi_fast
Closed

[MRG] ENH fast make_multilabel classification with sparse output support#2773
jnothman wants to merge 8 commits intoscikit-learn:masterfrom
jnothman:make_multi_fast

Conversation

@jnothman
Copy link
Copy Markdown
Member

I wanted sparse output from make_multilabel_classification, but in setting n_features high, I discovered quite how inefficient generator.multinomial(1, ...).argmax() is.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage remained the same when pulling e388305 on jnothman:make_multi_fast into a36c72a on scikit-learn:master.

@jnothman
Copy link
Copy Markdown
Member Author

Changed to MRG.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage remained the same when pulling 49e9069 on jnothman:make_multi_fast into a36c72a on scikit-learn:master.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 21, 2014

At the same time, you could add a sparse_output option.

@jnothman
Copy link
Copy Markdown
Member Author

I had done so already, @arjoly, only I called it sparse (cf. DictVectorizer). There's no precedent I can find for sparse_output, though dense_output has been used (cf. SparseRandomProjection, safe_sparse_dot).

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 21, 2014

At the same time, you could add a sparse_output option.

It wasn't clear. I mean to have a sparse label indicator y.

@jnothman
Copy link
Copy Markdown
Member Author

It wasn't clear. I mean to have a sparse label indicator y.

I realise that's in store, but I consider it out of scope for this PR. And
perhaps return_indicator='sparse' is how it will be named.

On 21 January 2014 20:31, Arnaud Joly notifications@github.com wrote:

At the same time, you could add a sparse_output option.

It wasn't clear. I mean to have a sparse label indicator y.


Reply to this email directly or view it on GitHubhttps://github.com//pull/2773#issuecomment-32832715
.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear that this is only related to the input.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 24, 2014

What is the speed gain?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this can certainly be vectorized. Sorry for not looking into this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, now I remember why. It's not so easily vectorised because each is generated by a different class. I'm not sure if there's a nice way to do this.

@jnothman
Copy link
Copy Markdown
Member Author

Thanks for the comments; I've pushed some changes.

I could vectorize sampling the number of classes and words, but that's relatively fast compared to the sampling of each word, so I doubt the reduced code clarity would be worth it.

@jnothman
Copy link
Copy Markdown
Member Author

Until we've sorted out exactly how vectorised we want this, expect tests to fail :)

The word generation can be vectorised, given that we're drawing each word uniformly from the sample's classes (not that this is mentioned in the docstring): we can draw from the summed probabilities.

@jnothman
Copy link
Copy Markdown
Member Author

So the latest version is more vectorized, but is not a substantial improvement in terms of speed, IMO. I think the big per-feature expense of multinomial was a big issue. Now we're just playing with unnecessary tweaking.

Which version do you think is better, @arjoly?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fan of this, except if there is a real gain.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the speed gain is small, I would plainly wrote the second alternative in the loop.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to backport choice as in https://github.com/scikit-learn/scikit-learn/pull/2638/files

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep here, until we increase our numpy support. And given that we're iterating through each sample in Python, and this doesn't need to be a super-fast function, we might as well not backport it.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 25, 2014

Which version do you think is better, @arjoly?

I think the last version is cleaner. Thanks !

Can you do some benchmark with some large n_samples, n_features and n_classes, e.g. n_samples=10 ** 4, n_features=10 ** 6, n_classes = 10 ** 3?

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 26, 2014

A small benchmark script

import argparse
import numpy as np

from sklearn.datasets import make_multilabel_classification

parser = argparse.ArgumentParser(description='Benchmark')
parser.add_argument('-s','--seed', type=int, default=None, nargs="?",
                    help="Random number generator seed")
parser.add_argument('-n', '--n_samples', type=int, default=10 ** 2, nargs="?")
parser.add_argument('-p', '--n_features', type=int, default=10 ** 3, nargs="?")
parser.add_argument('-d', '--n_classes', type=int, default=2 * 10 ** 2, nargs="?")
args = vars(parser.parse_args())
print(args)

make_multilabel_classification(n_samples=args["n_samples"],
    n_features=args["n_features"], n_classes=args["n_classes"],
    random_state=args["seed"])
(sklearn) ± time python benchmarks/bench_make_multilabel_classification.py -n 1000 -s 0
{'n_features': 1000, 'seed': 0, 'n_classes': 200, 'n_samples': 1000}
python benchmarks/bench_make_multilabel_classification.py -n 1000 -s 0  2.08s user 0.08s system 99% cpu 2.166 total

With the current branch,

(sklearn) ± time python benchmarks/bench_make_multilabel_classification.py -n 1000 -s 0
{'n_features': 1000, 'seed': 0, 'n_classes': 200, 'n_samples': 1000}
python benchmarks/bench_make_multilabel_classification.py -n 1000 -s 0  0.51s user 0.08s system 97% cpu 0.604 total

Thus that's pretty good :-) Now let's check what remains to optimize

(sklearn) ± kernprof.py -vl benchmarks/bench_make_multilabel_classification.py -n 1000 -p 1000 -d 1000 -s 0
{'n_features': 1000, 'seed': 0, 'n_classes': 1000, 'n_samples': 1000}
Wrote profile results to bench_make_multilabel_classification.py.lprof
Timer unit: 1e-06 s

File: /Users/ajoly/git/scikit-learn/sklearn/datasets/samples_generator.py
Function: make_multilabel_classification at line 243
Total time: 0.36192 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   243                                           @profile
   244                                           def make_multilabel_classification(n_samples=100, n_features=20, n_classes=5,
   245                                                                              n_labels=2, length=50, allow_unlabeled=True,
   246                                                                              sparse=False, return_indicator=False,
   247                                                                              random_state=None):
   248                                               """Generate a random multilabel classification problem.
   249                                           
   250                                               For each sample, the generative process is:
   251                                                   - pick the number of labels: n ~ Poisson(n_labels)
   252                                                   - n times, choose a class c: c ~ Multinomial(theta)
   253                                                   - pick the document length: k ~ Poisson(length)
   254                                                   - k times, choose a word: w ~ Multinomial(theta_c)
   255                                           
   256                                               In the above process, rejection sampling is used to make sure that
   257                                               n is never zero or more than `n_classes`, and that the document length
   258                                               is never zero. Likewise, we reject classes which have already been chosen.
   259                                           
   260                                               Parameters
   261                                               ----------
   262                                               n_samples : int, optional (default=100)
   263                                                   The number of samples.
   264                                           
   265                                               n_features : int, optional (default=20)
   266                                                   The total number of features.
   267                                           
   268                                               n_classes : int, optional (default=5)
   269                                                   The number of classes of the classification problem.
   270                                           
   271                                               n_labels : int, optional (default=2)
   272                                                   The average number of labels per instance. Number of labels follows
   273                                                   a Poisson distribution that never takes the value 0.
   274                                           
   275                                               length : int, optional (default=50)
   276                                                   Sum of the features (number of words if documents).
   277                                           
   278                                               allow_unlabeled : bool, optional (default=True)
   279                                                   If ``True``, some instances might not belong to any class.
   280                                           
   281                                               sparse : bool, optional (default=False)
   282                                                   If ``True``, return a sparse feature matrix
   283                                           
   284                                               return_indicator : bool, optional (default=False),
   285                                                   If ``True``, return ``Y`` in the binary indicator format, else
   286                                                   return a tuple of lists of labels.
   287                                           
   288                                               random_state : int, RandomState instance or None, optional (default=None)
   289                                                   If int, random_state is the seed used by the random number generator;
   290                                                   If RandomState instance, random_state is the random number generator;
   291                                                   If None, the random number generator is the RandomState instance used
   292                                                   by `np.random`.
   293                                           
   294                                               Returns
   295                                               -------
   296                                               X : array or sparse CSR matrix of shape [n_samples, n_features]
   297                                                   The generated samples.
   298                                           
   299                                               Y : tuple of lists or array of shape [n_samples, n_classes]
   300                                                   The label sets.
   301                                           
   302                                               """
   303         1            7      7.0      0.0      generator = check_random_state(random_state)
   304         1           31     31.0      0.0      p_c = generator.rand(n_classes)
   305         1           56     56.0      0.0      p_c /= p_c.sum()
   306         1        13482  13482.0      3.7      p_w_c = generator.rand(n_features, n_classes)
   307         1         5349   5349.0      1.5      p_w_c /= np.sum(p_w_c, axis=0)
   308                                           
   309         1            8      8.0      0.0      if hasattr(generator, 'choice'):
   310                                                   # available in numpy >=1.7
   311         1            5      5.0      0.0          def sample_classes(n):
   312                                                       return generator.choice(n_classes, n, replace=False, p=p_c)
   313                                               else:
   314                                                   cumulative_p_c = np.cumsum(p_c)
   315                                           
   316                                                   def sample_classes(n):
   317                                                       y = set()
   318                                                       while len(y) != n:
   319                                                           # pick a class with probability P(c)
   320                                                           c = np.searchsorted(cumulative_p_c, generator.rand())
   321                                                           y.add(c)
   322                                           
   323                                               # pick a (nonzero) number of labels per document by rejection sampling
   324         1          138    138.0      0.0      sample_n_labels = generator.poisson(n_labels, size=n_samples)
   325         1            2      2.0      0.0      while ((not allow_unlabeled and 0 in sample_n_labels) or
   326         1           56     56.0      0.0             np.max(sample_n_labels) > n_classes):
   327                                                   mask = sample_n_labels > n_classes
   328                                                   if not allow_unlabeled:
   329                                                       mask = np.logical_or(mask, sample_n_labels == 0, out=mask)
   330                                                   sample_n_labels[mask] = generator.poisson(n_labels, size=mask.sum())
   331                                           
   332                                               # pick a non-zero length per document by rejection sampling
   333         1          137    137.0      0.0      sample_length = generator.poisson(length, size=n_samples)
   334         1           31     31.0      0.0      while 0 in sample_length:
   335                                                   mask = sample_length == 0
   336                                                   sample_length[mask] = generator.poisson(length, size=mask.sum())
   337                                           
   338                                               # generate the samples
   339         1            6      6.0      0.0      X_indices = array.array('i')
   340         1           11     11.0      0.0      X_indptr = array.array('i', [0])
   341         1            3      3.0      0.0      Y = []
   342      1001         1749      1.7      0.5      for i in range(n_samples):
   343      1000       168834    168.8     46.6          y = sample_classes(sample_n_labels[i])
   344                                           
   345      1000         2268      2.3      0.6          if len(y) == 0:
   346                                                       # if sample does not belong to any class, generate noise words
   347       134          772      5.8      0.2              words = generator.randint(n_features, size=sample_length[i])
   348                                                   else:
   349                                                       # sample words without replacement from selected classes
   350       866        95028    109.7     26.3              p_w_sample = p_w_c[:, y].sum(axis=1)
   351       866        15596     18.0      4.3              p_w_sample /= p_w_sample.sum()
   352       866         7598      8.8      2.1              words = np.searchsorted(np.cumsum(p_w_sample),
   353       866        11485     13.3      3.2                                      generator.rand(sample_length[i]))
   354                                           
   355      1000        19908     19.9      5.5          X_indices.extend(words)
   356      1000         2381      2.4      0.7          X_indptr.append(len(X_indices))
   357      1000         4717      4.7      1.3          Y.append(list(y))
   358         1          266    266.0      0.1      X_data = np.ones(len(X_indices), dtype=np.float64)
   359         1            3      3.0      0.0      X = sp.csr_matrix((X_data, X_indices, X_indptr),
   360         1         5918   5918.0      1.6                        shape=(n_samples, n_features))
   361         1         1634   1634.0      0.5      X.sum_duplicates()
   362         1            2      2.0      0.0      if not sparse:
   363         1         4421   4421.0      1.2          X = X.toarray()
   364                                           
   365         1            2      2.0      0.0      if return_indicator:
   366                                                   lb = LabelBinarizer()
   367                                                   Y = lb.fit([range(n_classes)]).transform(Y)
   368                                               else:
   369         1           14     14.0      0.0          Y = tuple(Y)
   370                                           
   371         1            2      2.0      0.0      return X, Y

Thus the bottleneck is y = sample_classes(sample_n_labels[i]).

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 26, 2014

I have compared the two sample_classes implementation and the one, without choice, seems to be faster.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 26, 2014

with inline choice:

{'n_features': 1000, 'seed': 0, 'n_classes': 1000, 'n_samples': 1000}
python benchmarks/bench_make_multilabel_classification.py -n 1000 -p 1000 -d   0.58s user 0.08s system 99% cpu 0.658 total

with inline old sampling for y

(sklearn) ±  time  python benchmarks/bench_make_multilabel_classification.py -n 1000 -p 1000 -d 1000 -s 0
{'n_features': 1000, 'seed': 0, 'n_classes': 1000, 'n_samples': 1000}
python benchmarks/bench_make_multilabel_classification.py -n 1000 -p 1000 -d   0.45s user 0.08s system 99% cpu 0.532 total

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 26, 2014

It's possible to gain some more cpu cycle with

        y = set()
        while len(y) != sample_n_labels[i]:
            # pick a class with probability P(c)
            c = np.searchsorted(cumulative_p_c, generator.rand(sample_n_labels[i] - len(y)))
            y.update(c)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing this line by

            p_w_sample = p_w_c.take(y, axis=1).sum(axis=1)

leads to a significant speed up.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And no doubt a little more if mode='clip'

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 26, 2014

The last line by line benchmark gave

(sklearn) ± kernprof.py -vl benchmarks/bench_make_multilabel_classification.py -n 10000 -p 10000 -d 10000 -s 0
{'n_features': 10000, 'seed': 0, 'n_classes': 10000, 'n_samples': 10000}
Wrote profile results to bench_make_multilabel_classification.py.lprof
Timer unit: 1e-06 s

File: /Users/ajoly/git/scikit-learn/sklearn/datasets/samples_generator.py
Function: make_multilabel_classification at line 243
Total time: 7.85556 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   243                                           @profile
   244                                           def make_multilabel_classification(n_samples=100, n_features=20, n_classes=5,
   245                                                                              n_labels=2, length=50, allow_unlabeled=True,
   246                                                                              sparse=False, return_indicator=False,
   247                                                                              random_state=None):
   248                                               """Generate a random multilabel classification problem.
   249                                           
   250                                               For each sample, the generative process is:
   251                                                   - pick the number of labels: n ~ Poisson(n_labels)
   252                                                   - n times, choose a class c: c ~ Multinomial(theta)
   253                                                   - pick the document length: k ~ Poisson(length)
   254                                                   - k times, choose a word: w ~ Multinomial(theta_c)
   255                                           
   256                                               In the above process, rejection sampling is used to make sure that
   257                                               n is never zero or more than `n_classes`, and that the document length
   258                                               is never zero. Likewise, we reject classes which have already been chosen.
   259                                           
   260                                               Parameters
   261                                               ----------
   262                                               n_samples : int, optional (default=100)
   263                                                   The number of samples.
   264                                           
   265                                               n_features : int, optional (default=20)
   266                                                   The total number of features.
   267                                           
   268                                               n_classes : int, optional (default=5)
   269                                                   The number of classes of the classification problem.
   270                                           
   271                                               n_labels : int, optional (default=2)
   272                                                   The average number of labels per instance. Number of labels follows
   273                                                   a Poisson distribution that never takes the value 0.
   274                                           
   275                                               length : int, optional (default=50)
   276                                                   Sum of the features (number of words if documents).
   277                                           
   278                                               allow_unlabeled : bool, optional (default=True)
   279                                                   If ``True``, some instances might not belong to any class.
   280                                           
   281                                               sparse : bool, optional (default=False)
   282                                                   If ``True``, return a sparse feature matrix
   283                                           
   284                                               return_indicator : bool, optional (default=False),
   285                                                   If ``True``, return ``Y`` in the binary indicator format, else
   286                                                   return a tuple of lists of labels.
   287                                           
   288                                               random_state : int, RandomState instance or None, optional (default=None)
   289                                                   If int, random_state is the seed used by the random number generator;
   290                                                   If RandomState instance, random_state is the random number generator;
   291                                                   If None, the random number generator is the RandomState instance used
   292                                                   by `np.random`.
   293                                           
   294                                               Returns
   295                                               -------
   296                                               X : array or sparse CSR matrix of shape [n_samples, n_features]
   297                                                   The generated samples.
   298                                           
   299                                               Y : tuple of lists or array of shape [n_samples, n_classes]
   300                                                   The label sets.
   301                                           
   302                                               """
   303         1            7      7.0      0.0      generator = check_random_state(random_state)
   304         1          131    131.0      0.0      p_c = generator.rand(n_classes)
   305         1          103    103.0      0.0      p_c /= p_c.sum()
   306         1      1376641 1376641.0     17.5      p_w_c = generator.rand(n_features, n_classes)
   307         1       470986 470986.0      6.0      p_w_c /= np.sum(p_w_c, axis=0)
   308         1          134    134.0      0.0      cumulative_p_c = np.cumsum(p_c)
   309                                           
   310                                               # pick a (nonzero) number of labels per document by rejection sampling
   311         1          651    651.0      0.0      sample_n_labels = generator.poisson(n_labels, size=n_samples)
   312         1            2      2.0      0.0      while ((not allow_unlabeled and 0 in sample_n_labels) or
   313         1           49     49.0      0.0             np.max(sample_n_labels) > n_classes):
   314                                                   mask = sample_n_labels > n_classes
   315                                                   if not allow_unlabeled:
   316                                                       mask = np.logical_or(mask, sample_n_labels == 0, out=mask)
   317                                                   sample_n_labels[mask] = generator.poisson(n_labels, size=mask.sum())
   318                                           
   319                                               # pick a non-zero length per document by rejection sampling
   320         1          894    894.0      0.0      sample_length = generator.poisson(length, size=n_samples)
   321         1           55     55.0      0.0      while 0 in sample_length:
   322                                                   mask = sample_length == 0
   323                                                   sample_length[mask] = generator.poisson(length, size=mask.sum())
   324                                           
   325                                               # generate the samples
   326         1            5      5.0      0.0      X_indices = array.array('i')
   327         1            7      7.0      0.0      X_indptr = array.array('i', [0])
   328         1            1      1.0      0.0      Y = []
   329     10001        13042      1.3      0.2      for i in range(n_samples):
   330                                                   # y = generator.choice(n_classes, sample_n_labels[i], replace=False, p=p_c)
   331                                           
   332     10000        17350      1.7      0.2          y = set()
   333     18656        65408      3.5      0.8          while len(y) != sample_n_labels[i]:
   334                                                       # pick a class with probability P(c)
   335                                           
   336      8656        81903      9.5      1.0              c = np.searchsorted(cumulative_p_c, generator.rand(sample_n_labels[i] - len(y)))
   337      8656        34286      4.0      0.4              y.update(c)
   338                                           
   339     10000        22913      2.3      0.3          y = list(y)
   340                                           
   341     10000        13276      1.3      0.2          if len(y) == 0:
   342                                                       # if sample does not belong to any class, generate noise words
   343      1346         8843      6.6      0.1              words = generator.randint(n_features, size=sample_length[i])
   344                                                   else:
   345                                                       # sample words without replacement from selected classes
   346      8654      4099816    473.7     52.2              p_w_sample = p_w_c.take(y, axis=1).sum(axis=1)
   347      8654       550816     63.6      7.0              p_w_sample /= p_w_sample.sum()
   348      8654       282057     32.6      3.6              words = np.searchsorted(np.cumsum(p_w_sample),
   349      8654       137625     15.9      1.8                                      generator.rand(sample_length[i]))
   350                                           
   351     10000       176316     17.6      2.2          X_indices.extend(words)
   352     10000        19855      2.0      0.3          X_indptr.append(len(X_indices))
   353     10000        22029      2.2      0.3          Y.append(list(y))
   354         1         1849   1849.0      0.0      X_data = np.ones(len(X_indices), dtype=np.float64)
   355         1            2      2.0      0.0      X = sp.csr_matrix((X_data, X_indices, X_indptr),
   356         1        56252  56252.0      0.7                        shape=(n_samples, n_features))
   357         1        15413  15413.0      0.2      X.sum_duplicates()
   358         1            2      2.0      0.0      if not sparse:
   359         1       386668 386668.0      4.9          X = X.toarray()
   360                                           
   361         1            3      3.0      0.0      if return_indicator:
   362                                                   lb = LabelBinarizer()
   363                                                   Y = lb.fit([range(n_classes)]).transform(Y)
   364                                               else:
   365         1          165    165.0      0.0          Y = tuple(Y)
   366                                           
   367         1            3      3.0      0.0      return X, Y

There isn't much to optimize.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 26, 2014

With master

(sklearn) ± time python benchmarks/bench_make_multilabel_classification.py -n 10000 -p 10000 -d 10000 -s 0
{'n_features': 10000, 'seed': 0, 'n_classes': 10000, 'n_samples': 10000}
python benchmarks/bench_make_multilabel_classification.py -n 10000 -p 10000 -  231.37s user 1.62s system 99% cpu 3:53.46 total

With the last bit of optimization, I got

(sklearn) ± time python benchmarks/bench_make_multilabel_classification.py -n 10000 -p 10000 -d 10000 -s 0
{'n_features': 10000, 'seed': 0, 'n_classes': 10000, 'n_samples': 10000}
python benchmarks/bench_make_multilabel_classification.py -n 10000 -p 10000 -  7.38s user 0.63s system 99% cpu 8.013 total

Thus, we have (231.37 + 1.62) / (7.38 + 0.63) = 29.0... speed up !

Great thanks :-)

@jnothman
Copy link
Copy Markdown
Member Author

But you're making far too much effort to optimise something that doesn't need to be super-fast. I'll prefer readable code here than any optimisation -- as long as the time is reasonable, which the version at master is not for high n_features.

Please, don't profile this function. It's absolutely the wrong place to put effort into small gains.

@jnothman
Copy link
Copy Markdown
Member Author

So I vote for returning to the structure of ecc5876, without choice. A little slower, but easier to tell what the function is doing. Is that acceptable by you, @arjoly?

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 26, 2014

Please, don't profile this function. It's absolutely the wrong place to put effort into small gains.

Ok, then can you benchmark with time python benchmarks/bench_make_multilabel_classification.py -n 10000 -p 10000 -d 10000 -s 0 and shows that there is small gain in performing such optimisation?

I agree that I prefer to focus on bottleneck by line profiling.

@jnothman
Copy link
Copy Markdown
Member Author

Ok, then can you with time python
benchmarks/bench_make_multilabel_classification.py -n 10000 -p 10000 -d
10000 -s 0and shows that there is small gain in performing such
optimisation?
No, I need nothing so robust to show that removing multinomial(1) is a good
idea. It's many orders of magnitude slower than searchsorted/rand for many
features.

On 27 January 2014 01:40, Arnaud Joly notifications@github.com wrote:

Please, don't profile this function. It's absolutely the wrong place to
put effort into small gains.

Ok, then can you with time python
benchmarks/bench_make_multilabel_classification.py -n 10000 -p 10000 -d
10000 -s 0 and shows that there is small gain in performing such
optimisation?

I agree that I prefer to focus on bottleneck.


Reply to this email directly or view it on GitHubhttps://github.com//pull/2773#issuecomment-33318226
.

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Jan 27, 2014

So I vote for returning to the structure of ecc5876, without choice. A little slower, but easier to tell what the function is doing. Is that acceptable by you, @arjoly?

If you prefer ecc5876, then revert to this state. We'll continue the discussion from that point. I would keep the bit of optimization that aren't intrusive (preserve code readability) and maybe inline the nested function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants