Skip to content

ENH: speed up 32-bit and 64-bit np.argsort by 5x with AVX-512 #23707

Merged
charris merged 7 commits intonumpy:mainfrom
r-devulap:argsort-avx
May 18, 2023
Merged

ENH: speed up 32-bit and 64-bit np.argsort by 5x with AVX-512 #23707
charris merged 7 commits intonumpy:mainfrom
r-devulap:argsort-avx

Conversation

@r-devulap
Copy link
Copy Markdown
Member

Leverages latest optimizations to x86-simd-sort which provides AVX-512 routines for argsort. Algorithm details:

  • Broadly the same algorithm that was used to vectorize quicksort. Instead of using load instructions to load array into registers, this uses gather instructions to load the array elements via the index array. This allows sorting only the indices without modifying the original array or having to make a copy of it.
  • O(1) space
  • Up to 5x speed up on random arrays.
  • See PR: Add AVX-512 argsort for 32 and 64-bit data type x86-simd-sort#34 for all the details of the algorithm.

@r-devulap
Copy link
Copy Markdown
Member Author

PR numpy/x86-simd-sort#38 should resolve some of the test failures.

@charris
Copy link
Copy Markdown
Member

charris commented May 15, 2023

Needs rebase.

@charris
Copy link
Copy Markdown
Member

charris commented May 16, 2023

What is the reasoning behind using intptr_t? It seems to be an optional type in the C++-11.

@charris
Copy link
Copy Markdown
Member

charris commented May 16, 2023

Segfault on 32 bit windows. I think it is legitimate, but may be because of amergesort (timsort). See https://tinyurl.com/yrwhpffy.

@r-devulap
Copy link
Copy Markdown
Member Author

What is the reasoning behind using intptr_t? It seems to be an optional type in the C++-11.

I should stick to using npy_intp. No reason to use intptr_t.

@r-devulap
Copy link
Copy Markdown
Member Author

Segfault on 32 bit windows. I think it is legitimate, but may be because of amergesort (timsort). See https://tinyurl.com/yrwhpffy.

Yeah, I should have expected that. My AVX512 argsort implementations use i64gather instructions which means I can only use it when sizeof(npy_intp) is 8 bytes. 1d29c11 disables avx512_argsort in 32-bit mode.

@r-devulap
Copy link
Copy Markdown
Member Author

Benchmark numbers:

      before           after         ratio
     [921a97e6]       [c03369a1]
     <main>           <argsort-avx>
+      63.9±0.1μs       98.6±0.4μs     1.54  bench_function_base.Sort.time_argsort('quick', 'uint32', ('ordered',))
+     73.5±0.02μs        110±0.2μs     1.50  bench_function_base.Sort.time_argsort('quick', 'int64', ('ordered',))
+     75.2±0.08μs       98.6±0.4μs     1.31  bench_function_base.Sort.time_argsort('quick', 'int32', ('ordered',))
+      93.1±0.1μs        106±0.2μs     1.14  bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))
+      92.5±0.7μs        104±0.3μs     1.12  bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))
-       109±0.7μs        100±0.6μs     0.92  bench_function_base.Sort.time_argsort('quick', 'int32', ('reversed',))
-       149±0.5μs        106±0.2μs     0.71  bench_function_base.Sort.time_argsort('quick', 'float64', ('reversed',))
-       155±0.1μs        108±0.5μs     0.70  bench_function_base.Sort.time_argsort('quick', 'float32', ('reversed',))
-       341±0.4μs        105±0.4μs     0.31  bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 1000))
-       389±0.3μs        119±0.5μs     0.31  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))
-       414±0.2μs        114±0.2μs     0.28  bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 1000))
-       395±0.4μs        105±0.3μs     0.27  bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 1000))
-       439±0.3μs        113±0.2μs     0.26  bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 1000))
-       446±0.3μs        111±0.2μs     0.25  bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 100))
-       511±0.3μs        127±0.4μs     0.25  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10))
-       500±0.6μs        124±0.5μs     0.25  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100))
-       457±0.4μs        109±0.3μs     0.24  bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 10))
-       536±0.2μs        119±0.1μs     0.22  bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 100))
-         504±1μs        112±0.2μs     0.22  bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 100))
-       506±0.3μs        109±0.2μs     0.22  bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 10))
-         561±1μs        121±0.3μs     0.22  bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 10))
-       600±0.5μs        125±0.2μs     0.21  bench_function_base.Sort.time_argsort('quick', 'int64', ('random',))
-       569±0.8μs        118±0.8μs     0.21  bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 100))
-       545±0.2μs        110±0.3μs     0.20  bench_function_base.Sort.time_argsort('quick', 'uint32', ('random',))
-       586±0.5μs        117±0.2μs     0.20  bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 10))
-       598±0.9μs        111±0.6μs     0.19  bench_function_base.Sort.time_argsort('quick', 'int32', ('random',))
-         658±2μs        119±0.2μs     0.18  bench_function_base.Sort.time_argsort('quick', 'float64', ('random',))
-       698±0.3μs        118±0.2μs     0.17  bench_function_base.Sort.time_argsort('quick', 'float32', ('random',))
-     94.2±0.07μs      10.3±0.07μs     0.11  bench_function_base.Sort.time_argsort('quick', 'uint32', ('uniform',))
-      99.1±0.2μs      10.4±0.06μs     0.10  bench_function_base.Sort.time_argsort('quick', 'int32', ('uniform',))
-      111±0.05μs      10.5±0.04μs     0.09  bench_function_base.Sort.time_argsort('quick', 'int64', ('uniform',))
-       177±0.2μs      11.4±0.02μs     0.06  bench_function_base.Sort.time_argsort('quick', 'float64', ('uniform',))
-       220±0.2μs      11.2±0.03μs     0.05  bench_function_base.Sort.time_argsort('quick', 'float32', ('uniform',))

@charris charris merged commit f2db090 into numpy:main May 18, 2023
@charris
Copy link
Copy Markdown
Member

charris commented May 18, 2023

Let's give this a shot. Thanks @r-devulap .

@r-devulap
Copy link
Copy Markdown
Member Author

Do you think this needs a release note? I would like to combine release notes for #22315 and this, if that is okay.

@charris
Copy link
Copy Markdown
Member

charris commented May 18, 2023

I was thinking a release note would be nice, thanks for offering :) The release notes are linked to the PR, so it is probably simpler to just make two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants