[MRG] remove warnings in univariate feature selection by larsmans · Pull Request #2369 · scikit-learn/scikit-learn

larsmans · 2013-08-19T09:29:00Z

These warnings are practically always triggered when doing text classification or any task with lots of boolean features. I suggest to just remove them, since in those cases the warning is so confusing that it does more harm than good.

ogrisel · 2013-08-19T09:53:57Z

+1

GaelVaroquaux · 2013-08-19T11:48:11Z

👍 for removal, as long as we use a stable, non-random, sort. The reason is that I want to have 100% reproducibility. The default sort used by argsort is quicksort which is not stable. Should we switch to a heapsort, which is stable, but has the drawback of requiring p/2 work space in memory? I think that the work space requirement is not too bad, is it is in O(p) and not O(n p).

larsmans · 2013-08-19T11:57:15Z

What is p in this formula? According to Wikipedia, heapsort should require O(1) auxiliary space (apart from the n indices allocated by argsort, of course).

GaelVaroquaux · 2013-08-19T11:58:59Z

What is p in this formula?

Number of features in the learning problem.

According to Wikipedia, heapsort should require O(1) auxiliary space

Correct, I made a mistake and meant mergesort rather than heapsort, which
is the only stable sort implemented in numpy.

larsmans · 2013-08-19T12:03:14Z

Actually there's a heapsort in NumPy master and it seems to have been there since the days of numarray. But since argsort takes linear space for its output anyway, I suggest we just take the fastest option. I'll profile a bit.

larsmans · 2013-08-19T12:30:17Z

Timings:

>>> scores = np.random.randn(10000000)
>>> %timeit np.argsort(scores, kind='quicksort')
1 loops, best of 3: 3.35 s per loop
>>> %timeit np.argsort(scores, kind='heapsort')
1 loops, best of 3: 18.1 s per loop
>>> %timeit np.argsort(scores, kind='mergesort')
1 loops, best of 3: 3.08 s per loop

Again, with fresh random numbers:

>>> scores = np.random.randn(10000000)
>>> %timeit np.argsort(scores, kind='quicksort')
1 loops, best of 3: 3.32 s per loop
>>> %timeit np.argsort(scores, kind='heapsort')
1 loops, best of 3: 17.9 s per loop
>>> %timeit np.argsort(scores, kind='mergesort')
1 loops, best of 3: 3.08 s per loop

Memory usage:

$ cat testsort.py 
import numpy as np
import sys

rng = np.random.RandomState(0xCAFE)

scores = rng.randn(10000000)
np.argsort(scores, kind=sys.argv[1])

$ /usr/bin/time python testsort.py quicksort
3.91user 0.10system 0:04.02elapsed 99%CPU (0avgtext+0avgdata 169604maxresident)k
0inputs+0outputs (0major+42856minor)pagefaults 0swaps

$ /usr/bin/time python testsort.py heapsort
18.70user 0.11system 0:18.84elapsed 99%CPU (0avgtext+0avgdata 169608maxresident)k
0inputs+0outputs (0major+42858minor)pagefaults 0swaps

$ /usr/bin/time python testsort.py mergesort
3.65user 0.14system 0:03.80elapsed 99%CPU (0avgtext+0avgdata 208676maxresident)k
0inputs+0outputs (0major+52624minor)pagefaults 0swaps

Without the argsort, the test script takes just over a second to generate these random numbers. So indeed, mergesort takes more memory (40MB per megafeature), but it can actually be faster than quicksort. Heapsort is dead slow.

GaelVaroquaux · 2013-08-19T12:33:49Z

So indeed, mergesort takes more memory (40MB per megafeature), but it can actually be faster than quicksort. Heapsort is really slow.

So let's use mergesort. I don't find the memory-usage numbers
unreasonnable, and the behavior (stable sort) will be less suprising to
users.

agramfort · 2013-08-19T12:46:43Z

while you're at it I'd also like to have a stable sort in StratifiedKFold :)

On Mon, Aug 19, 2013 at 2:33 PM, Gael Varoquaux
notifications@github.com wrote:

So indeed, mergesort takes more memory (40MB per megafeature), but it can
actually be faster than quicksort. Heapsort is really slow.

So let's use mergesort. I don't find the memory-usage numbers
unreasonnable, and the behavior (stable sort) will be less suprising to
users.

—
Reply to this email directly or view it on GitHub.

GaelVaroquaux · 2013-08-19T12:47:34Z

while you're at it I'd also like to have a stable sort in StratifiedKFold :)

PR welcomed :P

agramfort · 2013-08-19T12:50:16Z

PR welcomed :P

I've heard this before ;)
Seriously I'll do it when I'm back at work unless Lars beats me to it.

These warnings are issued practically always when using frequency-valued or boolean data. Switched to a stable sort to get reproducible results.

larsmans · 2013-08-19T12:57:46Z

Force-pushed a new version. Time to go back to the actual experiment I was performing, @agramfort stratified k-fold is yours :p

GaelVaroquaux · 2013-08-19T13:40:48Z

👍 for merge. Thanks!

[MRG] remove warnings in univariate feature selection

ogrisel · 2013-08-19T14:57:30Z

I pushed the green button as travis was happy.

FIX remove warnings from univariate FS

b46ea71

These warnings are issued practically always when using frequency-valued or boolean data. Switched to a stable sort to get reproducible results.

ogrisel added a commit that referenced this pull request Aug 19, 2013

Merge pull request #2369 from larsmans/no-warnings-in-fs

9033baf

[MRG] remove warnings in univariate feature selection

ogrisel merged commit 9033baf into scikit-learn:master Aug 19, 2013

larsmans deleted the no-warnings-in-fs branch October 2, 2015 14:21

Uh oh!

Conversation

larsmans commented Aug 19, 2013

Uh oh!

ogrisel commented Aug 19, 2013

Uh oh!

GaelVaroquaux commented Aug 19, 2013

Uh oh!

larsmans commented Aug 19, 2013

Uh oh!

GaelVaroquaux commented Aug 19, 2013

Uh oh!

larsmans commented Aug 19, 2013

Uh oh!

larsmans commented Aug 19, 2013

Uh oh!

GaelVaroquaux commented Aug 19, 2013

Uh oh!

agramfort commented Aug 19, 2013

Uh oh!

GaelVaroquaux commented Aug 19, 2013

Uh oh!

agramfort commented Aug 19, 2013

Uh oh!

larsmans commented Aug 19, 2013

Uh oh!

GaelVaroquaux commented Aug 19, 2013

Uh oh!

ogrisel commented Aug 19, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants