Skip to content

ENH: Implement faster keyword argument parsing capable of METH_FASTCALL#15269

Merged
mattip merged 3 commits intonumpy:mainfrom
seberg:splitup-faster-argparsing-infrastructure
Mar 18, 2021
Merged

ENH: Implement faster keyword argument parsing capable of METH_FASTCALL#15269
mattip merged 3 commits intonumpy:mainfrom
seberg:splitup-faster-argparsing-infrastructure

Conversation

@seberg
Copy link
Copy Markdown
Member

@seberg seberg commented Jan 6, 2020

The first commits adds the ifrastructure and macros necessary for the argument parsing. The second commit uses it for methods.c and functions like ones, empty, empty_like that are defined in C as fairly simple examples. Some of the functions macros are not used here.

This replaces gh-15099


I am not sure what would be a better approach; Unless argument clinic; I may have misjudged it, but it seems like using it is a larger change and I am not sure how simple it with support of different python versions.

I am not quite sure that the addition in common/npy_argparse.? is the right approach for something like this?

There will be two follow up PRs:

  1. Using this to remove the special cases/speedup np.array.
  2. A larger cleanup/use of this for ufuncs, both of which I will mark as draft for now and include these commits.

Comment on lines 103 to 104
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this - do you mean must be passed positionally?

Or are you trying to describe "not keyword-only"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter, not keyword only, which seems simpler than to count the number of keyword only. Will try to rephrase if this seems to go anywhere.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend

Suggested change
* @return Returns 1 on success and 0 on failure.
* @return Returns 0 on success and -1 on failure.

which is more typical for error-reporting in python - converter functions seem to be the exception not the rule.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally prefer -1, I had it like this only, because it is closer to the code using PyArg_Parse* code, but am happy change.

@seberg seberg force-pushed the splitup-faster-argparsing-infrastructure branch 3 times, most recently from 10e700b to 7f9cf12 Compare January 7, 2020 01:54
@seberg seberg added triage review Issue/PR to be discussed at the next triage meeting triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Jan 15, 2020
@seberg
Copy link
Copy Markdown
Member Author

seberg commented Nov 11, 2020

It seems PyPy 3.7 now supports fastcall, so we could use that as an excuse to reactivate this PR (removing all kwarg dictionary paths). The somewhat ugly argument/parameter macros could then be removed!

Basically, I do think that this was a very straight forward (and not actually super large) PR, the difficult part in this project were mostly the ufunc maintanence (which is a later PR). The only real alternative I am aware if using argument clinic such as here: https://github.com/pitrou/pickle5-backport/tree/master/pickle5/clinic and https://github.com/pitrou/pickle5-backport/blob/master/BACKPORT-NOTES.txt which would require some tooling or vendoring of all clinic versions, which I am not too fond of.

About the ufuncs: That maintenance is absolutely necessary in large parts, we could push that through, but I need to touch the same code massively for ufuncs, and it will be hell to try to split out the "maintenance" from the rest, so those things can't be worked on in parallel. Either we do the argument parsing soon, or we give up on ufunc argument parsing until dtypes are finished. The first way may even be easier for me, but that can only work if we check it off at a reasonable speed. (I do remember Marten did even have a look at that old PR).

To be clear: I mainly comment here to note that we could do this slightly nicer now. If anyone is happy to review this, I very much think this is worthwhile and I can pick this up at almost any time if we have the review bandwidth to push it through.

@seberg seberg removed the triaged Issue/PR that was discussed in a triage meeting label Nov 11, 2020
@mattip
Copy link
Copy Markdown
Member

mattip commented Nov 12, 2020

There are merge conflicts. Could you run benchmarks to show this is justified?

@seberg
Copy link
Copy Markdown
Member Author

seberg commented Nov 12, 2020

I don't want to snipe you away from dtypes right now :). I do think its worthwhile, even as a first step to have the infrastructure, since it does delete some ugly fast-paths, speeds up some things and might make optimiizing __array_function__ also a bit easier in the long run. But, of course if we think this might turn out to be technical debt...

Some examples for changes (it requires a dirty rebase right now, unfortunately it is not trivial also due to the -like= argument):

arr = np.arange(100)
np.asarray(arr)  # about 4x faster
np.mean(arr)  # about 15% faser
# Or cherry pick
np.add.reduce(arr, axis=0, dtype=arr.dtype, where=True)  # ~40% faster

arr = np.arange(100.)  # double precision, is printed differently
repr(arr)  # 25% faster, arguably of course you could move more to C instead of optimizing the Python-C interface

So is that a lot? I don't know, most real-world code has many other larger overheads probably.

Basically, the reason why I did not have nice benchmarks, is that:

  • __array_function__ requires fixing as well to make it worthwhile, I wasn't sure that was very viable, so it only worked for the subset not covered by it. This is different now, having VectorCall should allow to follow up with such speedups much easier.

    Details
    arr = np.arange(100)
    %timeit np.concatenate((arr, arr), axis=0)
    
    master without __array_function__: 1.07 µs
    vectorcall without __array_function__:  826 ns 
    master with __array_function__: 1.29 µs
    vectorcall with __array_function__: 1.36 µs
    

    so there is no differencen with __array_function__, but a good speedup without. I had some old notes that vectorcall support helps tickle out a 20% improvement for __array_function__ in general.

  • I would love to measure "real world" impact rather than np.asarray and some reductions on its own, but I don't really know a good example. The 4x speedup for np.asarray is clear and asv has a run (although it probably underestimates the impact).

@charris
Copy link
Copy Markdown
Member

charris commented Nov 12, 2020

I would prefer to push this off to 1.21, there is already a ton of stuff in 1.20 that might bite us.

@seberg
Copy link
Copy Markdown
Member Author

seberg commented Nov 12, 2020

@charris sorry, I was never suggesting to rush this in before that. I just realized that we can clean up some of the uglier parts now, and that I though it was probably the main reasons why it stalled.

@seberg
Copy link
Copy Markdown
Member Author

seberg commented Dec 6, 2020

Lets see what PyPy thinks. I actually updated this, speedups are the same (significant e.g. for np.mean, or np.median 30-40% IIRC for small arrays (goes up to 1000 elements or more for these with caching).

EDIT: to be clear, this PR signficantly speeds up printing arrays (due to the dragon4 argparsing), but the main speedups are np.asarray (which is 4 times faster) and reductions, probably, and those are the later PRs (which apparently got some commits mangled up (tests failing right now)

I changed it now look like this (example from later PR):

    NPY_PREPARE_ARGPARSER;  // Translates to a static cache.

    if (npy_parse_arguments("empty", args, len_args, kwnames,
            "shape", &PyArray_IntpConverter, &shape,
            "|dtype", &PyArray_DescrConverter, &typecode,
            "|order", &PyArray_OrderConverter, &order,
            "$like", NULL, &like,
            NULL, NULL, NULL) < 0) {
        goto fail;
    }

using -1 as return value for errors and | and $ to indicate optional and keyword only. I would like to remind everyone that we already have this time of (nongeneric) code for ufunc,__call__, so rathr than adding maintanence it reduces the amount of complex code we got by a lot. Of course that means we have to go through with the ufunc cleanup, but that would probably help me a lot with dtypes also, so I am biased towards just swalling that pill even if nobody can review every line of code. This is interface code, it is unlikely to be very subtly bugged.

Happy to:

  1. Add some micro benchmarks
  2. Add a few words in the empty "internals" doc section

once the merging starts.

Base automatically changed from master to main March 4, 2021 02:04
@seberg
Copy link
Copy Markdown
Member Author

seberg commented Mar 11, 2021

I ran the benchmarks with the full diff (there seem some small merge conflicts or just refcount bugs still to straighten out there though) – that is, the timings are with more commits (a lot of random fluctuations, e.g. those time_cont_assign, I was thinking that https://vstinner.github.io/journey-to-stable-benchmark-deadcode.html might point to the solution – using profile guided optimization! But not going to cross that bridge now :) ). So its hard to say, the time_count_nonzero for very small arrays are probably real speedups.

(going to update the branch today, but there is probably some fiddling to do with the commits, and then hope that PyPy likes it now :) )

Details
       before           after         ratio
     [2ea7ebdc]       [03f3710a]
     <main>           <splitup-faster-argparsing>
+        2.91±0μs      5.14±0.01μs     1.76  bench_io.Copy.time_cont_assign('float32')
+      107±0.05μs        181±0.1μs     1.70  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
+         222±1μs        369±0.8μs     1.66  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
+      91.1±0.2μs        149±0.1μs     1.63  bench_core.UnpackBits.time_unpackbits_axis1
+        356±20μs          568±5μs     1.60  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
+      31.4±0.2μs      49.2±0.05μs     1.57  bench_io.CopyTo.time_copyto_sparse
+     5.44±0.02μs      8.33±0.03μs     1.53  bench_core.UnpackBits.time_unpackbits
+        4.19±0μs      6.40±0.01μs     1.53  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
+     6.91±0.07μs      10.4±0.06μs     1.50  bench_ufunc_strides.Unary.time_ufunc('isfinite', 1, 2, 'f')
+     3.14±0.01μs      4.63±0.04μs     1.48  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
+     11.4±0.03μs      16.3±0.02μs     1.43  bench_function_base.Sort.time_argsort('merge', 'int16', ('uniform',))
+     14.1±0.05μs      20.2±0.02μs     1.43  bench_function_base.Sort.time_argsort('merge', 'int64', ('reversed',))
+     11.5±0.04μs      16.3±0.02μs     1.43  bench_function_base.Sort.time_argsort('merge', 'int16', ('ordered',))
+      7.45±0.1μs      10.5±0.09μs     1.41  bench_ufunc_strides.Unary.time_ufunc('isfinite', 1, 4, 'f')
+       344±0.4μs          485±2μs     1.41  bench_core.PackBits.time_packbits_axis0(<class 'bool'>)
+         157±4μs          220±2μs     1.40  bench_reduce.AddReduceSeparate.time_reduce(0, 'float64')
+        927±30μs      1.29±0.02ms     1.39  bench_linalg.Einsum.time_einsum_noncon_outer(<class 'numpy.float32'>)
+        2.10±0μs         2.82±0μs     1.34  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
+         133±2μs          174±1μs     1.31  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float32'>)
+     1.28±0.01μs      1.64±0.01μs     1.29  bench_itemselection.PutMask.time_sparse(True, 'float32')
+     1.29±0.01μs      1.64±0.01μs     1.28  bench_itemselection.PutMask.time_sparse(True, 'int32')
+     1.29±0.02μs      1.63±0.01μs     1.26  bench_itemselection.PutMask.time_sparse(True, 'complex256')
+     13.2±0.09μs      16.7±0.07μs     1.26  bench_ufunc_strides.Unary.time_ufunc('isfinite', 2, 2, 'f')
+     13.3±0.02μs      16.7±0.03μs     1.25  bench_ufunc_strides.Unary.time_ufunc('isfinite', 4, 2, 'f')
+      13.5±0.1μs      16.7±0.03μs     1.23  bench_ufunc_strides.Unary.time_ufunc('isfinite', 2, 4, 'f')
+     1.63±0.02μs         1.99±0μs     1.23  bench_itemselection.PutMask.time_sparse(False, 'longfloat')
+      32.8±0.1μs       40.1±0.1μs     1.22  bench_function_base.Sort.time_argsort('heap', 'int16', ('uniform',))
+     13.8±0.08μs      16.8±0.04μs     1.22  bench_ufunc_strides.Unary.time_ufunc('isfinite', 4, 4, 'f')
+     1.64±0.01μs       1.99±0.1μs     1.21  bench_itemselection.PutMask.time_sparse(False, 'complex128')
+     16.7±0.01μs       20.1±0.1μs     1.20  bench_function_base.Sort.time_argsort('merge', 'float64', ('uniform',))
+     16.7±0.05μs      20.0±0.04μs     1.20  bench_function_base.Sort.time_argsort('merge', 'float64', ('ordered',))
+     13.8±0.05μs      16.5±0.02μs     1.20  bench_ufunc.Custom.time_nonzero
+     16.9±0.05μs      20.2±0.03μs     1.20  bench_function_base.Sort.time_argsort('merge', 'float64', ('reversed',))
+     43.8±0.05μs         51.7±1μs     1.18  bench_core.Core.time_array_float64_l1000
+     4.75±0.02ms      5.57±0.03ms     1.17  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float32'>)
+      2.88±0.1ms         3.36±0ms     1.17  bench_lib.Pad.time_pad((256, 128, 1), 8, 'linear_ramp')
+         966±1μs      1.13±0.01ms     1.16  bench_core.PackBits.time_packbits_axis0(<class 'numpy.uint64'>)
+      129±0.08μs        150±0.1μs     1.16  bench_function_base.Sort.time_argsort('quick', 'float64', ('reversed',))
+      68.1±0.9μs         78.5±3μs     1.15  bench_io.CopyTo.time_copyto_8_sparse
+      69.1±0.2μs       78.2±0.3μs     1.13  bench_function_base.Sort.time_argsort('quick', 'int64', ('uniform',))
+         307±2μs          346±1μs     1.13  bench_function_base.Select.time_select_larger
+      6.31±0.2μs       7.09±0.3μs     1.12  bench_indexing.ScalarIndexing.time_assign(0)
+      56.0±0.3μs         62.7±1μs     1.12  bench_function_base.Sort.time_argsort('merge', 'int16', ('sorted_block', 100))
+     1.75±0.01μs       1.94±0.2μs     1.11  bench_itemselection.PutMask.time_dense(False, 'complex128')
+      16.6±0.2μs      18.3±0.08μs     1.11  bench_function_base.Where.time_1
+      56.5±0.2μs       62.5±0.3μs     1.11  bench_function_base.Sort.time_argsort('merge', 'int16', ('sorted_block', 10))
+      9.48±0.1μs      10.4±0.09μs     1.10  bench_indexing.ScalarIndexing.time_assign_cast(0)
+     33.1±0.03μs      36.2±0.01μs     1.10  bench_function_base.Sort.time_argsort('heap', 'int64', ('uniform',))
+     9.42±0.03ms      10.2±0.01ms     1.09  bench_ufunc.UFunc.time_ufunc_types('sin')
+      62.4±0.5μs       67.5±0.9μs     1.08  bench_core.Core.time_array_float_l1000
+     82.7±0.06μs       89.2±0.2μs     1.08  bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))
+     36.9±0.03μs      39.8±0.03μs     1.08  bench_function_base.Sort.time_sort('heap', 'float64', ('uniform',))
+         10.2±0s          10.9±0s     1.07  bench_ufunc_strides.Mandelbrot.time_mandel
+        86.5±1μs       92.0±0.3μs     1.06  bench_function_base.Sort.time_argsort('merge', 'int16', ('random',))
+       831±0.3μs          884±4μs     1.06  bench_core.PackBits.time_packbits_axis1(<class 'numpy.uint64'>)
+     42.5±0.03μs      45.2±0.09μs     1.06  bench_core.PackBits.time_packbits(<class 'numpy.uint64'>)
+     42.8±0.03μs       45.4±0.2μs     1.06  bench_core.PackBits.time_packbits_little(<class 'numpy.uint64'>)
+      11.2±0.2μs      11.9±0.01μs     1.06  bench_indexing.ScalarIndexing.time_assign_cast(1)
+      56.6±0.3μs      59.9±0.08μs     1.06  bench_function_base.Sort.time_argsort('merge', 'float64', ('sorted_block', 1000))
+         152±1μs          161±2μs     1.06  bench_function_base.Sort.time_argsort('merge', 'float64', ('sorted_block', 10))
+     12.3±0.02μs      13.0±0.02μs     1.06  bench_indexing.ScalarIndexing.time_assign_cast(2)
+      98.5±0.2μs       104±0.06μs     1.05  bench_function_base.Sort.time_argsort('merge', 'float64', ('sorted_block', 100))
+       518±0.3μs        544±0.7μs     1.05  bench_function_base.Sort.time_sort('heap', 'int16', ('reversed',))
+       605±0.4μs        636±0.7μs     1.05  bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 100))
+     13.1±0.04μs      13.8±0.06μs     1.05  bench_function_base.Sort.time_argsort('merge', 'int64', ('uniform',))
+        2.18±0μs      2.29±0.01μs     1.05  bench_scalar.ScalarMath.time_power_of_two('float64')
+        1.47±0μs      1.55±0.01μs     1.05  bench_scalar.ScalarMath.time_power_of_two('int64')
+       293±0.3μs          308±3μs     1.05  bench_ufunc.UFunc.time_ufunc_types('greater')
-         473±4μs          450±4μs     0.95  bench_lib.Nan.time_nanmean(200000, 2.0)
-      19.9±0.4μs       18.9±0.1μs     0.95  bench_ma.UFunc.time_scalar_1d(True, True, 10)
-        1.01±0μs          966±8ns     0.95  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), subok=True, where=True))
-     15.5±0.04μs      14.7±0.02μs     0.95  bench_function_base.Sort.time_sort('merge', 'float64', ('ordered',))
-      49.4±0.2μs       47.0±0.6μs     0.95  bench_lib.Nan.time_nanvar(200, 0.1)
-         289±1μs        275±0.9μs     0.95  bench_trim_zeros.TrimZeros.time_trim_zeros(dtype('float64'), 3000)
-      41.2±0.2μs      39.2±0.04μs     0.95  bench_random.Choice.time_choice(100000000.0)
-     1.75±0.03μs      1.67±0.01μs     0.95  bench_ma.Indexing.time_scalar(False, 2, 100)
-      36.0±0.1μs       34.2±0.4μs     0.95  bench_function_base.Median.time_odd
-      73.9±0.5μs       70.2±0.3μs     0.95  bench_function_base.Percentile.time_quartile
-      28.6±0.1μs       27.1±0.3μs     0.95  bench_ma.UFunc.time_1d(True, False, 1000)
-      55.9±0.2μs       53.1±0.2μs     0.95  bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))
-         916±6ns          870±2ns     0.95  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), out=array(3.)))
-     1.67±0.02μs      1.59±0.03μs     0.95  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0))
-     29.2±0.08μs       27.7±0.2μs     0.95  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int32'>)
-      52.6±0.3μs      49.9±0.05μs     0.95  bench_lib.Pad.time_pad((4, 4, 4, 4), 1, 'reflect')
-         392±5μs          372±5μs     0.95  bench_lib.Nan.time_nanargmax(200000, 0)
-     1.57±0.06ms         1.49±0ms     0.95  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 8, 'mean')
-         413±5μs          391±6μs     0.95  bench_lib.Nan.time_nanargmin(200000, 2.0)
-     14.2±0.03μs       13.4±0.3μs     0.95  bench_ma.MA.time_masked_array_l100
-        81.6±4μs      77.3±0.01μs     0.95  bench_linalg.Linalg.time_op('norm', 'float16')
-      49.7±0.8μs       47.0±0.9μs     0.95  bench_lib.Nan.time_nanvar(200, 0)
-         396±4μs          375±5μs     0.95  bench_lib.Nan.time_nanargmax(200000, 0.1)
-       108±0.2ms        103±0.2ms     0.95  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), (0, 32), 'edge')
-      50.3±0.2μs       47.6±0.3μs     0.95  bench_lib.Nan.time_nanvar(200, 50.0)
-     5.12±0.03μs      4.85±0.06μs     0.95  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'float32')
-     1.88±0.02μs      1.78±0.01μs     0.95  bench_array_coercion.ArrayCoercionSmall.time_array_dtype_not_kwargs(range(0, 3))
-     31.5±0.07μs      29.8±0.03μs     0.95  bench_lib.Pad.time_pad((4, 4, 4, 4), 1, 'constant')
-      61.9±0.3μs       58.5±0.8μs     0.95  bench_indexing.Indexing.time_op('indexes_', 'I', '')
-     15.6±0.06μs      14.7±0.07μs     0.95  bench_ma.UFunc.time_scalar_1d(False, False, 10)
-     4.51±0.02μs      4.26±0.04μs     0.95  bench_core.Core.time_identity_100
-         465±1ns          439±7ns     0.95  bench_array_coercion.ArrayCoercionSmall.time_array(array([5]))
-      34.4±0.2μs      32.5±0.03μs     0.94  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 1, 'constant')
-      24.7±0.4μs       23.2±0.2μs     0.94  bench_ma.UFunc.time_scalar_1d(True, False, 10)
-         385±5μs          363±5μs     0.94  bench_lib.Nan.time_nanprod(200000, 0)
-      19.8±0.2μs      18.7±0.02μs     0.94  bench_ma.MA.time_masked_array_l100_t100
-      26.6±0.9μs       25.0±0.2μs     0.94  bench_ma.UFunc.time_1d(False, True, 1000)
-     1.58±0.01μs      1.49±0.03μs     0.94  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0, None, array(0.)))
-     2.06±0.02μs      1.93±0.01μs     0.94  bench_reduce.AnyAll.time_any_fast
-         389±5μs          366±5μs     0.94  bench_lib.Nan.time_nanprod(200000, 0.1)
-        1.18±0ms      1.11±0.01ms     0.94  bench_lib.Pad.time_pad((256, 128, 1), 8, 'reflect')
-         622±3μs          585±4μs     0.94  bench_lib.Pad.time_pad((4, 4, 4, 4), 8, 'wrap')
-         398±3μs          375±3μs     0.94  bench_lib.Pad.time_pad((4, 4, 4, 4), 1, 'linear_ramp')
-      43.1±0.2μs       40.6±0.2μs     0.94  bench_ma.Concatenate.time_it('unmasked+masked', 2)
-      23.6±0.2μs       22.2±0.5μs     0.94  bench_ma.UFunc.time_1d(True, False, 10)
-     2.08±0.07μs      1.95±0.03μs     0.94  bench_core.Core.time_array_l_view
-      28.3±0.3μs       26.5±0.1μs     0.94  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int64'>)
-     1.66±0.02μs      1.56±0.03μs     0.94  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.])))
-     6.74±0.07μs      6.33±0.02μs     0.94  bench_lib.Nan.time_nanmax(200, 50.0)
-      28.8±0.1ms      27.0±0.07ms     0.94  bench_trim_zeros.TrimZeros.time_trim_zeros(dtype('float64'), 300000)
-       318±0.5μs        299±0.5μs     0.94  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))
-        23.3±1μs       21.9±0.4μs     0.94  bench_ma.UFunc.time_1d(False, True, 100)
-        38.0±1μs       35.6±0.8μs     0.94  bench_function_base.Sort.time_sort('merge', 'int16', ('sorted_block', 100))
-         230±2μs          216±2μs     0.94  bench_lib.Pad.time_pad((256, 128, 1), 1, 'mean')
-     2.52±0.01ms      2.36±0.02ms     0.94  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 8, 'edge')
-        39.3±1μs       36.8±0.4μs     0.94  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int64'>)
-         466±5μs          436±3μs     0.94  bench_trim_zeros.TrimZeros.time_trim_zeros(dtype('complex128'), 3000)
-        1.17±0ms      1.09±0.01ms     0.94  bench_lib.Pad.time_pad((256, 128, 1), 8, 'edge')
-     2.08±0.01μs      1.94±0.01μs     0.94  bench_reduce.AnyAll.time_all_fast
-      8.60±0.2μs      8.05±0.01μs     0.94  bench_core.Core.time_array_l100
-         374±4μs          350±4μs     0.94  bench_lib.Nan.time_nanmean(200000, 0)
-     4.76±0.04μs      4.45±0.04μs     0.94  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'complex64')
-         473±2μs          442±4μs     0.94  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 1, 'linear_ramp')
-         377±4μs          352±4μs     0.94  bench_lib.Nan.time_nanmean(200000, 0.1)
-     1.80±0.01μs      1.68±0.02μs     0.94  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), axis=0))
-      58.2±0.2μs      54.4±0.03μs     0.93  bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))
-      34.5±0.3μs       32.2±0.6μs     0.93  bench_random.Choice.time_choice(1000000.0)
-         320±5μs          299±6μs     0.93  bench_lib.Nan.time_nanargmin(200000, 0)
-     2.90±0.02ms         2.71±0ms     0.93  bench_trim_zeros.TrimZeros.time_trim_zeros(dtype('float64'), 30000)
-      25.5±0.2μs       23.8±0.1μs     0.93  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int8'>)
-     5.14±0.01μs      4.79±0.02μs     0.93  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'int32')
-         627±7ns          585±2ns     0.93  bench_ufunc.Scalar.time_add_scalar
-         570±2μs          531±1μs     0.93  bench_lib.Nan.time_nanstd(200000, 0.1)
-         324±5μs          302±5μs     0.93  bench_lib.Nan.time_nanargmin(200000, 0.1)
-         563±3μs          525±2μs     0.93  bench_lib.Nan.time_nanstd(200000, 0)
-     6.65±0.01μs      6.20±0.02μs     0.93  bench_lib.Nan.time_nanmin(200, 2.0)
-     1.89±0.03μs      1.76±0.03μs     0.93  bench_array_coercion.ArrayCoercionSmall.time_array(range(0, 3))
-     3.20±0.04μs         2.98±0μs     0.93  bench_reduce.SmallReduction.time_small
-      46.7±0.3μs       43.5±0.1μs     0.93  bench_core.VarComplex.time_var(10000)
-      6.94±0.1μs      6.46±0.02μs     0.93  bench_lib.Nan.time_nanmax(200, 90.0)
-     19.8±0.01μs       18.5±0.1μs     0.93  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int16'>)
-     18.9±0.08μs       17.6±0.2μs     0.93  bench_core.Core.time_diagflat_l50_l50
-      25.5±0.1μs       23.7±0.5μs     0.93  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int8'>)
-     4.76±0.02μs      4.43±0.01μs     0.93  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'float64')
-      39.4±0.2μs      36.6±0.05μs     0.93  bench_ma.Concatenate.time_it('ndarray+masked', 2)
-     4.79±0.03μs      4.45±0.02μs     0.93  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'int64')
-        557±10μs        517±0.4μs     0.93  bench_lib.Pad.time_pad((256, 128, 1), 8, 'constant')
-      29.6±0.3μs      27.5±0.08μs     0.93  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int32'>)
-     5.10±0.02μs      4.73±0.04μs     0.93  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'float16')
-     28.4±0.04μs      26.3±0.04μs     0.93  bench_random.Choice.time_choice(1000.0)
-      26.6±0.1μs       24.6±0.2μs     0.93  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int16'>)
-      10.4±0.5ms         9.69±0ms     0.93  bench_ufunc.UFunc.time_ufunc_types('cos')
-      28.4±0.7μs       26.3±0.3μs     0.93  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int64'>)
-        992±10ns          920±5ns     0.93  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), out=array(3.), subok=True, where=True))
-      26.7±0.1μs       24.7±0.1μs     0.93  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int16'>)
-         567±3μs          525±2μs     0.93  bench_lib.Nan.time_nanvar(200000, 0.1)
-         887±4μs          821±3μs     0.93  bench_lib.Pad.time_pad((256, 128, 1), 8, 'mean')
-     1.71±0.02μs      1.58±0.03μs     0.92  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0, None))
-     6.58±0.03μs      6.09±0.03μs     0.92  bench_lib.Nan.time_nanmax(200, 0.1)
-      23.8±0.2μs      22.0±0.07μs     0.92  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'bool'>)
-     6.57±0.05μs      6.07±0.04μs     0.92  bench_lib.Nan.time_nanmin(200, 0)
-       113±0.4μs        104±0.7μs     0.92  bench_lib.Pad.time_pad((4, 4, 4, 4), 1, 'mean')
-     6.99±0.06μs      6.46±0.04μs     0.92  bench_lib.Nan.time_nanmin(200, 90.0)
-         560±2μs          517±2μs     0.92  bench_lib.Nan.time_nanvar(200000, 0)
-     6.73±0.04μs      6.21±0.02μs     0.92  bench_lib.Nan.time_nanmax(200, 2.0)
-         117±1μs        108±0.5μs     0.92  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 1, 'mean')
-     9.68±0.02μs      8.93±0.06μs     0.92  bench_lib.Unique.time_unique(200, 0.1)
-      14.2±0.1μs       13.1±0.2μs     0.92  bench_lib.Unique.time_unique(200, 50.0)
-     5.13±0.01μs      4.73±0.03μs     0.92  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'int16')
-     4.04±0.02μs      3.72±0.02μs     0.92  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'complex256')
-      4.06±0.1μs      3.73±0.02μs     0.92  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'longfloat')
-      6.63±0.2μs         6.10±0μs     0.92  bench_lib.Nan.time_nanmax(200, 0)
-       319±0.9μs        294±0.8μs     0.92  bench_function_base.Sort.time_argsort('quick', 'int16', ('sorted_block', 1000))
-      6.60±0.1μs      6.08±0.01μs     0.92  bench_lib.Nan.time_nanmin(200, 0.1)
-      39.4±0.8μs       36.2±0.1μs     0.92  bench_function_base.Sort.time_sort('merge', 'int64', ('sorted_block', 1000))
-      20.3±0.2μs       18.7±0.4μs     0.92  bench_linalg.Einsum.time_einsum_noncon_contig_outstride0(<class 'numpy.float64'>)
-      23.7±0.2μs      21.8±0.09μs     0.92  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'bool'>)
-      22.1±0.1μs      20.3±0.04μs     0.92  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int32'>)
-        344±30μs        316±0.1μs     0.92  bench_random.Bounded.time_bounded('PCG64', [<class 'numpy.uint32'>, 1535])
-     4.04±0.02μs      3.71±0.01μs     0.92  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'longfloat')
-      9.68±0.2μs       8.89±0.1μs     0.92  bench_lib.Unique.time_unique(200, 0)
-      14.6±0.2μs       13.4±0.2μs     0.92  bench_lib.Unique.time_unique(200, 2.0)
-     17.8±0.03μs      16.3±0.04μs     0.92  bench_ufunc_strides.Unary.time_ufunc('isfinite', 4, 4, 'd')
-      47.0±0.4ms       43.2±0.1ms     0.92  bench_trim_zeros.TrimZeros.time_trim_zeros(dtype('complex128'), 300000)
-     1.16±0.02ms         1.06±0ms     0.92  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 8, 'constant')
-     6.84±0.02μs      6.27±0.07μs     0.92  bench_lib.Nan.time_nanmin(200, 50.0)
-     24.1±0.06μs      22.1±0.09μs     0.92  bench_lib.Nan.time_nanmean(200, 90.0)
-     4.70±0.04ms      4.31±0.01ms     0.92  bench_trim_zeros.TrimZeros.time_trim_zeros(dtype('complex128'), 30000)
-     1.74±0.04μs      1.60±0.02μs     0.92  bench_core.Core.time_array_l
-     12.2±0.01μs      11.1±0.09μs     0.92  bench_function_base.Sort.time_sort('merge', 'int64', ('reversed',))
-      15.4±0.1μs      14.1±0.03μs     0.92  bench_lib.Nan.time_nanargmin(200, 0.1)
-     15.7±0.04μs      14.3±0.02μs     0.91  bench_lib.Nan.time_nanargmax(200, 2.0)
-     3.99±0.02μs      3.65±0.03μs     0.91  bench_core.Core.time_eye_100
-     22.0±0.08μs      20.1±0.06μs     0.91  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int32'>)
-     3.67±0.02ms      3.36±0.02ms     0.91  bench_indexing.IndexingSeparate.time_mmap_fancy_indexing
-     1.73±0.01μs      1.58±0.03μs     0.91  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), out=array(0.)))
-      7.31±0.4μs      6.68±0.03μs     0.91  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'complex256')
-     20.3±0.03μs      18.5±0.02μs     0.91  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int16'>)
-     23.9±0.06μs       21.8±0.3μs     0.91  bench_lib.Nan.time_nanmean(200, 2.0)
-     19.5±0.06μs      17.8±0.05μs     0.91  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int8'>)
-      38.0±0.8μs       34.7±0.2μs     0.91  bench_linalg.Einsum.time_einsum_noncon_sum_mul2(<class 'numpy.float32'>)
-         231±5μs          210±5μs     0.91  bench_lib.Nan.time_nansum(200000, 0.1)
-     19.4±0.05μs       17.7±0.1μs     0.91  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int8'>)
-     4.09±0.06μs      3.72±0.02μs     0.91  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'complex128')
-      27.7±0.3μs       25.2±0.5μs     0.91  bench_linalg.Einsum.time_einsum_noncon_contig_contig(<class 'numpy.float64'>)
-     15.8±0.07μs      14.4±0.02μs     0.91  bench_lib.Nan.time_nanargmax(200, 90.0)
-      24.1±0.3μs      22.0±0.04μs     0.91  bench_function_base.Median.time_odd_small
-      80.3±0.3ms       73.0±0.8ms     0.91  bench_app.LaplaceInplace.time_it('inplace')
-      15.5±0.1μs      14.1±0.02μs     0.91  bench_lib.Nan.time_nanargmin(200, 2.0)
-     23.9±0.09μs       21.8±0.3μs     0.91  bench_lib.Nan.time_nanmean(200, 0)
-      15.5±0.1μs      14.1±0.08μs     0.91  bench_lib.Nan.time_nanargmax(200, 0.1)
-     38.7±0.08μs       35.1±0.1μs     0.91  bench_linalg.Einsum.time_einsum_noncon_sum_mul(<class 'numpy.float32'>)
-      15.5±0.1μs       14.1±0.1μs     0.91  bench_lib.Nan.time_nanargmin(200, 0)
-      13.7±0.2μs       12.5±0.2μs     0.91  bench_lib.Unique.time_unique(200, 90.0)
-      36.9±0.2μs         33.5±1μs     0.91  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float32'>)
-      9.01±0.2μs      8.18±0.05μs     0.91  bench_lib.Nan.time_nancumprod(200, 90.0)
-         227±5μs          206±6μs     0.91  bench_lib.Nan.time_nansum(200000, 0)
-     24.1±0.03μs      21.9±0.05μs     0.91  bench_function_base.Median.time_even_small
-      112±0.08μs        102±0.2μs     0.91  bench_function_base.Sort.time_argsort('quick', 'int64', ('reversed',))
-     16.0±0.06μs      14.5±0.03μs     0.91  bench_lib.Nan.time_nanargmax(200, 50.0)
-      41.3±0.2μs       37.5±0.1μs     0.91  bench_linalg.Einsum.time_einsum_noncon_sum_mul2(<class 'numpy.float64'>)
-     8.65±0.01μs         7.84±0μs     0.91  bench_lib.Nan.time_nancumsum(200, 0)
-      15.8±0.1μs      14.3±0.01μs     0.91  bench_lib.Nan.time_nanargmin(200, 50.0)
-     1.87±0.01μs      1.70±0.01μs     0.91  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), axis=0, dtype=None))
-      8.68±0.2μs      7.85±0.02μs     0.90  bench_lib.Nan.time_nancumsum(200, 0.1)
-     15.7±0.05μs      14.2±0.03μs     0.90  bench_lib.Nan.time_nanargmin(200, 90.0)
-      32.5±0.5μs       29.4±0.6μs     0.90  bench_lib.Nan.time_nanmedian(200, 50.0)
-       110±0.4μs       99.1±0.4μs     0.90  bench_function_base.Sort.time_argsort('quick', 'int16', ('reversed',))
-     9.13±0.07μs      8.25±0.02μs     0.90  bench_lib.Nan.time_nancumsum(200, 50.0)
-      8.97±0.2μs      8.11±0.01μs     0.90  bench_lib.Nan.time_nancumsum(200, 90.0)
-         578±4ns          522±4ns     0.90  bench_core.Core.time_empty_100
-     24.5±0.09μs      22.1±0.04μs     0.90  bench_lib.Nan.time_nanmean(200, 50.0)
-      9.15±0.2μs      8.26±0.05μs     0.90  bench_lib.Nan.time_nancumprod(200, 50.0)
-     9.77±0.07μs      8.81±0.01μs     0.90  bench_lib.Nan.time_nansum(200, 90.0)
-      17.6±0.2μs      15.9±0.05μs     0.90  bench_ufunc_strides.Unary.time_ufunc('isfinite', 2, 4, 'd')
-     17.9±0.04μs      16.1±0.01μs     0.90  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'bool'>)
-      10.1±0.1μs      9.06±0.03μs     0.90  bench_ma.UFunc.time_scalar(False, False, 100)
-     32.2±0.07μs         29.0±1μs     0.90  bench_lib.Nan.time_nanmedian(200, 90.0)
-     1.82±0.03μs      1.64±0.02μs     0.90  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), axis=0, dtype=None, out=array(0.)))
-      8.74±0.1μs      7.86±0.01μs     0.90  bench_lib.Nan.time_nancumsum(200, 2.0)
-      8.68±0.2μs      7.81±0.01μs     0.90  bench_lib.Nan.time_nancumprod(200, 2.0)
-         480±2ns          432±6ns     0.90  bench_array_coercion.ArrayCoercionSmall.time_array_dtype_not_kwargs(1)
-     9.55±0.05μs      8.59±0.01μs     0.90  bench_lib.Nan.time_nanprod(200, 0)
-      8.70±0.2μs      7.82±0.02μs     0.90  bench_lib.Nan.time_nancumprod(200, 0)
-      38.1±0.1μs       34.3±0.2μs     0.90  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float64'>)
-      57.3±0.8μs       51.5±0.3μs     0.90  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float64'>)
-     1.85±0.05ms      1.66±0.05ms     0.90  bench_lib.Pad.time_pad((4, 4, 4, 4), (0, 32), 'mean')
-      41.6±0.2μs       37.3±0.6μs     0.90  bench_linalg.Einsum.time_einsum_noncon_sum_mul(<class 'numpy.float64'>)
-      31.5±0.3ms       28.3±0.2ms     0.90  bench_linalg.Eindot.time_einsum_ijk_jil_kl
-      17.3±0.1μs      15.5±0.03μs     0.90  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int64'>)
-      36.6±0.5μs       32.8±0.5μs     0.90  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float32'>)
-     10.0±0.03μs      9.00±0.03μs     0.90  bench_lib.Nan.time_nanprod(200, 50.0)
-      68.2±0.6μs         61.1±1μs     0.90  bench_lib.Nan.time_nanpercentile(200, 0.1)
-      24.2±0.2μs       21.7±0.1μs     0.90  bench_lib.Nan.time_nanmean(200, 0.1)
-      74.8±0.3μs       67.0±0.5μs     0.90  bench_lib.Nan.time_nanquantile(200, 90.0)
-      8.75±0.2μs      7.83±0.02μs     0.90  bench_lib.Nan.time_nancumprod(200, 0.1)
-         878±4ns          786±9ns     0.89  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.)))
-      30.8±0.3μs       27.5±0.8μs     0.89  bench_ma.UFunc.time_scalar_1d(False, True, 1000)
-     9.57±0.03μs      8.56±0.03μs     0.89  bench_lib.Nan.time_nansum(200, 2.0)
-      74.2±0.2μs       66.3±0.2μs     0.89  bench_lib.Nan.time_nanpercentile(200, 90.0)
-     9.98±0.09μs      8.92±0.01μs     0.89  bench_lib.Nan.time_nansum(200, 50.0)
-      75.3±0.2μs       67.3±0.7μs     0.89  bench_lib.Nan.time_nanquantile(200, 2.0)
-      74.0±0.4μs       66.1±0.6μs     0.89  bench_lib.Nan.time_nanpercentile(200, 50.0)
-     18.0±0.04μs       16.1±0.1μs     0.89  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'bool'>)
-      17.8±0.6μs      15.9±0.03μs     0.89  bench_core.Core.time_diagflat_l100
-        869±20ns          776±3ns     0.89  bench_array_coercion.ArrayCoercionSmall.time_array_dtype_not_kwargs([1])
-      75.6±0.4μs       67.6±0.6μs     0.89  bench_lib.Nan.time_nanquantile(200, 50.0)
-      67.4±0.5μs       60.1±0.6μs     0.89  bench_lib.Nan.time_nanpercentile(200, 0)
-      26.0±0.2μs      23.2±0.07μs     0.89  bench_ma.UFunc.time_scalar_1d(False, True, 10)
-      10.1±0.1μs      9.02±0.07μs     0.89  bench_ma.UFunc.time_scalar(False, False, 10)
-        995±10ns          886±2ns     0.89  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), array(3.), subok=True, where=True))
-     13.3±0.04μs       11.8±0.1μs     0.89  bench_ma.UFunc.time_scalar(False, True, 1000)
-         490±5ns          436±6ns     0.89  bench_array_coercion.ArrayCoercionSmall.time_array_dtype_not_kwargs(5)
-      10.1±0.1μs      8.98±0.05μs     0.89  bench_ma.UFunc.time_scalar(False, False, 1000)
-      15.7±0.2μs      13.9±0.04μs     0.89  bench_lib.Nan.time_nanargmax(200, 0)
-     25.4±0.09μs       22.6±0.4μs     0.89  bench_linalg.Einsum.time_einsum_noncon_mul(<class 'numpy.float32'>)
-     5.93±0.02μs      5.27±0.03μs     0.89  bench_ma.MA.time_masked_array
-     9.93±0.02μs      8.82±0.04μs     0.89  bench_lib.Nan.time_nanprod(200, 90.0)
-     9.51±0.02μs      8.45±0.09μs     0.89  bench_lib.Nan.time_nanprod(200, 0.1)
-      18.1±0.3μs       16.1±0.1μs     0.89  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int64'>)
-     13.3±0.05μs      11.8±0.09μs     0.89  bench_ma.UFunc.time_scalar(True, False, 100)
-      26.2±0.2μs      23.3±0.09μs     0.89  bench_ma.UFunc.time_scalar_1d(False, True, 100)
-         602±3ns          535±5ns     0.89  bench_core.Core.time_zeros_100
-      32.9±0.3μs       29.2±0.6μs     0.89  bench_lib.Nan.time_nanmedian(200, 2.0)
-     13.3±0.04μs       11.8±0.1μs     0.89  bench_ma.UFunc.time_scalar(True, False, 1000)
-     13.3±0.03μs      11.8±0.09μs     0.89  bench_ma.UFunc.time_scalar(True, False, 10)
-      4.22±0.2μs      3.74±0.02μs     0.89  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'complex128')
-     9.70±0.03μs      8.59±0.02μs     0.89  bench_lib.Nan.time_nanprod(200, 2.0)
-      70.2±0.3μs       62.1±0.7μs     0.88  bench_lib.Nan.time_nanquantile(200, 0)
-      38.1±0.2μs       33.7±0.1μs     0.88  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float64'>)
-     13.3±0.08μs       11.8±0.1μs     0.88  bench_ma.UFunc.time_scalar(False, True, 100)
-      4.23±0.2μs      3.74±0.02μs     0.88  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'complex256')
-        57.9±1μs       51.1±0.3μs     0.88  bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float64'>)
-     5.37±0.04μs      4.73±0.03μs     0.88  bench_core.Core.time_dstack_l
-        25.3±1μs         22.3±1μs     0.88  bench_linalg.Einsum.time_einsum_noncon_mul(<class 'numpy.float64'>)
-      69.9±0.5μs       61.5±0.6μs     0.88  bench_lib.Nan.time_nanquantile(200, 0.1)
-     3.05±0.02μs      2.67±0.04μs     0.88  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'longfloat')
-      28.1±0.5μs       24.6±0.8μs     0.88  bench_linalg.Einsum.time_einsum_noncon_contig_contig(<class 'numpy.float32'>)
-      13.6±0.2μs      11.9±0.05μs     0.88  bench_core.Core.time_diag_l100
-         433±1ns        380±0.5ns     0.88  bench_array_coercion.ArrayCoercionSmall.time_array(5)
-     13.6±0.06μs       12.0±0.2μs     0.88  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int16'>)
-         628±3ns          551±3ns     0.88  bench_core.Core.time_array_empty
-      13.1±0.3ms      11.5±0.07ms     0.88  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float64'>)
-     9.63±0.02μs      8.43±0.04μs     0.88  bench_lib.Nan.time_nansum(200, 0)
-     3.05±0.01μs      2.67±0.03μs     0.87  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'complex128')
-      74.0±0.1μs       64.7±0.2μs     0.87  bench_lib.Nan.time_nanpercentile(200, 2.0)
-     9.70±0.08μs      8.47±0.01μs     0.87  bench_lib.Nan.time_nansum(200, 0.1)
-     14.2±0.04μs      12.4±0.09μs     0.87  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int32'>)
-     4.78±0.02μs      4.17±0.04μs     0.87  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int64')
-      14.8±0.2μs       12.9±0.4μs     0.87  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int32'>)
-     13.4±0.02μs      11.7±0.03μs     0.87  bench_ma.UFunc.time_scalar(False, True, 10)
-      13.3±0.1μs      11.6±0.03μs     0.87  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int16'>)
-      30.6±0.3μs       26.6±0.6μs     0.87  bench_lib.Nan.time_nanmedian(200, 0)
-        577±10ns         502±20ns     0.87  bench_array_coercion.ArrayCoercionSmall.time_array_dtype_not_kwargs(array([5]))
-     1.69±0.02μs      1.47±0.02μs     0.87  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'int16')
-     4.84±0.04μs      4.20±0.03μs     0.87  bench_core.Core.time_vstack_l
-     4.76±0.03μs      4.13±0.04μs     0.87  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'complex64')
-     5.10±0.01μs      4.43±0.05μs     0.87  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'int32')
-     30.9±0.06μs       26.8±0.3μs     0.87  bench_lib.Nan.time_nanmedian(200, 0.1)
-     2.67±0.01μs      2.31±0.02μs     0.87  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'complex64')
-     5.11±0.02μs      4.43±0.03μs     0.87  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'float32')
-      23.5±0.1μs       20.4±0.2μs     0.87  bench_core.VarComplex.time_var(1000)
-     5.11±0.02μs      4.43±0.01μs     0.87  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int32')
-       211±0.3ms          183±2ms     0.87  bench_function_base.Histogram2D.time_fine_binning
-     5.11±0.02μs      4.43±0.01μs     0.87  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'float32')
-     2.75±0.01μs      2.37±0.03μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'int64')
-     2.80±0.01μs      2.42±0.03μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'float16')
-     1.70±0.01μs      1.46±0.02μs     0.86  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'float16')
-     2.69±0.02μs      2.32±0.02μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'int64')
-     2.68±0.01μs      2.31±0.02μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'float64')
-     2.75±0.02μs      2.37±0.03μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'complex64')
-     13.5±0.05μs      11.6±0.05μs     0.86  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int8'>)
-     2.81±0.01μs      2.42±0.03μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'int16')
-     2.75±0.02μs      2.37±0.03μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'float64')
-     1.91±0.01ms      1.64±0.01ms     0.86  bench_lib.Pad.time_pad((4, 4, 4, 4), (0, 32), 'edge')
-     2.64±0.02μs      2.28±0.03μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'float32')
-     2.69±0.01μs      2.31±0.02μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'int16')
-     13.1±0.03μs       11.2±0.1μs     0.86  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int8'>)
-      20.5±0.2μs       17.6±0.3μs     0.86  bench_linalg.Einsum.time_einsum_noncon_contig_outstride0(<class 'numpy.float32'>)
-     4.76±0.01μs      4.09±0.02μs     0.86  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'float64')
-     4.76±0.03μs      4.08±0.01μs     0.86  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'complex64')
-      1.26±0.1ms      1.08±0.06ms     0.86  bench_lib.Pad.time_pad((256, 128, 1), (0, 32), 'edge')
-     5.10±0.01μs      4.37±0.03μs     0.86  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'int16')
-     2.69±0.01μs      2.31±0.03μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'float16')
-     2.71±0.02μs      2.32±0.02μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'float32')
-     5.11±0.01μs      4.38±0.03μs     0.86  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int16')
-     5.11±0.01μs      4.37±0.02μs     0.86  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'float16')
-     2.91±0.07μs      2.49±0.02μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'complex128')
-     4.77±0.01μs      4.08±0.02μs     0.86  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'float64')
-     2.72±0.02μs      2.32±0.02μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'int32')
-        664±10ns         567±10ns     0.85  bench_core.Core.time_arange_100
-         466±1ns          399±5ns     0.85  bench_array_coercion.ArrayCoercionSmall.time_array(1)
-     6.99±0.03μs      5.97±0.05μs     0.85  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'complex256')
-     2.14±0.03μs      1.82±0.01μs     0.85  bench_array_coercion.ArrayCoercionSmall.time_array_no_copy(range(0, 3))
-     4.06±0.03μs      3.47±0.03μs     0.85  bench_core.Core.time_hstack_l
-      9.74±0.1μs      8.32±0.08μs     0.85  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'str'>)
-     2.66±0.03μs      2.27±0.03μs     0.85  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'int32')
-      1.27±0.1ms      1.08±0.06ms     0.85  bench_lib.Pad.time_pad((256, 128, 1), (0, 32), 'reflect')
-     5.12±0.02μs      4.36±0.04μs     0.85  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'float16')
-     4.78±0.01μs      4.07±0.02μs     0.85  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'int64')
-     2.91±0.01μs      2.47±0.02μs     0.85  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'longfloat')
-      20.5±0.2μs      17.4±0.02μs     0.85  bench_core.VarComplex.time_var(100)
-     11.6±0.06μs      9.80±0.07μs     0.85  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'object'>)
-     12.0±0.03μs      10.1±0.06μs     0.85  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'bool'>)
-      9.65±0.1μs      8.15±0.05μs     0.84  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'str'>)
-     11.7±0.04μs      9.91±0.07μs     0.84  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'object'>)
-     2.40±0.02μs      2.02±0.02μs     0.84  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'complex64')
-     2.17±0.01μs      1.83±0.01μs     0.84  bench_array_coercion.ArrayCoercionSmall.time_asanyarray_dtype(range(0, 3))
-     1.99±0.01μs      1.68±0.01μs     0.84  bench_itemselection.PutMask.time_dense(False, 'float16')
-     2.00±0.02μs      1.68±0.01μs     0.84  bench_itemselection.PutMask.time_dense(False, 'int16')
-     2.40±0.03μs      2.02±0.03μs     0.84  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'int64')
-       535±0.6μs        450±0.3μs     0.84  bench_linalg.Eindot.time_einsum_i_ij_j
-        773±10ns          649±3ns     0.84  bench_core.Core.time_array_l1
-     2.40±0.02μs      2.01±0.02μs     0.84  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'float64')
-        842±10ns          706±3ns     0.84  bench_array_coercion.ArrayCoercionSmall.time_array([1])
-     2.18±0.04μs      1.83±0.02μs     0.84  bench_array_coercion.ArrayCoercionSmall.time_asarray_dtype(range(0, 3))
-     20.2±0.07μs      16.9±0.07μs     0.84  bench_core.VarComplex.time_var(10)
-      127±0.07μs        107±0.5μs     0.84  bench_function_base.Bincount.time_bincount
-     8.92±0.07μs      7.47±0.06μs     0.84  bench_core.Core.time_triu_l10x10
-     4.06±0.01μs      3.39±0.02μs     0.84  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'complex128')
-      12.4±0.1μs      10.4±0.09μs     0.84  bench_ufunc_strides.Unary.time_ufunc('isfinite', 1, 4, 'd')
-     4.06±0.01μs      3.38±0.02μs     0.83  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'longfloat')
-     4.07±0.01μs      3.39±0.02μs     0.83  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'complex256')
-     78.0±0.03μs       64.8±0.2μs     0.83  bench_function_base.Sort.time_argsort('quick', 'int64', ('ordered',))
-     10.1±0.02μs      8.37±0.03μs     0.83  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'object'>)
-      25.0±0.3μs      20.7±0.05μs     0.83  bench_linalg.Linalg.time_op('norm', 'int64')
-     8.90±0.05μs      7.38±0.03μs     0.83  bench_core.CountNonzero.time_count_nonzero_axis(2, 100, <class 'str'>)
-     9.05±0.04μs       7.49±0.1μs     0.83  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'str'>)
-      8.98±0.1μs      7.43±0.01μs     0.83  bench_core.Core.time_tril_l10x10
-     2.10±0.01μs      1.73±0.02μs     0.82  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'complex64')
-      77.3±0.2μs      63.6±0.05μs     0.82  bench_function_base.Sort.time_argsort('quick', 'int16', ('ordered',))
-     9.97±0.04μs      8.20±0.03μs     0.82  bench_core.CountNonzero.time_count_nonzero_axis(2, 100, <class 'object'>)
-     1.98±0.01μs      1.63±0.02μs     0.82  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'complex64')
-      7.93±0.2μs       6.51±0.1μs     0.82  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'str'>)
-     17.6±0.02μs       14.4±0.1μs     0.82  bench_ufunc_strides.Unary.time_ufunc('isfinite', 4, 2, 'd')
-         870±2ns          714±3ns     0.82  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), array(3.)))
-     2.10±0.01μs      1.72±0.03μs     0.82  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'float64')
-     2.11±0.02μs      1.72±0.02μs     0.82  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'longfloat')
-     2.11±0.01μs      1.72±0.02μs     0.82  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'complex128')
-        1.99±0μs      1.63±0.02μs     0.82  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'float64')
-     1.99±0.01μs      1.63±0.03μs     0.82  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'int64')
-     2.01±0.02μs      1.64±0.02μs     0.82  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'complex128')
-      12.8±0.1μs      10.4±0.02μs     0.82  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'bool'>)
-     17.0±0.04μs      13.8±0.06μs     0.81  bench_ufunc_strides.Unary.time_ufunc('isinf', 4, 4, 'f')
-     1.88±0.01μs      1.53±0.03μs     0.81  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'int32')
-     1.88±0.01μs      1.53±0.02μs     0.81  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'float32')
-      16.9±0.1μs      13.7±0.06μs     0.81  bench_ufunc_strides.Unary.time_ufunc('isinf', 2, 4, 'f')
-     1.94±0.02μs      1.58±0.03μs     0.81  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'int16')
-        1.89±0μs      1.53±0.02μs     0.81  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'int64')
-     2.12±0.01μs      1.72±0.03μs     0.81  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'int64')
-     1.89±0.01μs      1.53±0.02μs     0.81  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'complex64')
-     1.89±0.02μs      1.53±0.02μs     0.81  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'float64')
-     2.27±0.01μs      1.83±0.01μs     0.81  bench_array_coercion.ArrayCoercionSmall.time_array_subok(range(0, 3))
-     2.02±0.02μs      1.63±0.03μs     0.81  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'longfloat')
-     11.3±0.06μs      9.10±0.09μs     0.81  bench_shape_base.Block.time_no_lists(10)
-     16.9±0.06μs       13.6±0.3μs     0.81  bench_ufunc_strides.Unary.time_ufunc('isnan', 2, 4, 'f')
-     1.95±0.02μs      1.57±0.03μs     0.81  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'float16')
-     17.2±0.09μs       13.9±0.2μs     0.81  bench_ufunc_strides.Unary.time_ufunc('isnan', 4, 4, 'f')
-     1.93±0.01μs      1.55±0.03μs     0.80  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'float64')
-     1.93±0.01μs      1.55±0.03μs     0.80  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'float32')
-     1.93±0.01μs      1.55±0.03μs     0.80  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'complex64')
-     1.91±0.02μs      1.53±0.02μs     0.80  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'int32')
-     1.93±0.01μs      1.55±0.03μs     0.80  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'int64')
-     1.89±0.01μs      1.51±0.02μs     0.80  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'int16')
-     1.89±0.02μs      1.51±0.02μs     0.80  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'int32')
-      7.81±0.1μs       6.25±0.1μs     0.80  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'str'>)
-     1.88±0.02μs      1.51±0.02μs     0.80  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'int16')
-     1.93±0.01μs      1.54±0.03μs     0.80  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'int32')
-     1.91±0.01μs      1.53±0.02μs     0.80  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'float16')
-     1.89±0.01μs      1.51±0.02μs     0.80  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'float16')
-     1.89±0.01μs      1.51±0.02μs     0.80  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'float32')
-     1.89±0.01μs      1.50±0.02μs     0.80  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'float16')
-     1.92±0.01μs      1.53±0.02μs     0.80  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'float32')
-     1.91±0.01μs      1.52±0.02μs     0.80  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'int16')
-     1.93±0.01μs      1.53±0.01μs     0.80  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
-     15.9±0.08μs      12.6±0.03μs     0.80  bench_ufunc_strides.Unary.time_ufunc('signbit', 4, 4, 'f')
-      16.6±0.1μs       13.2±0.6μs     0.80  bench_ufunc_strides.Unary.time_ufunc('isnan', 2, 2, 'f')
-     16.8±0.05μs       13.3±0.3μs     0.79  bench_ufunc_strides.Unary.time_ufunc('isnan', 4, 2, 'f')
-     16.8±0.08μs      13.3±0.04μs     0.79  bench_ufunc_strides.Unary.time_ufunc('isinf', 4, 2, 'f')
-     1.63±0.01μs      1.29±0.02μs     0.79  bench_itemselection.PutMask.time_sparse(True, 'float16')
-     16.6±0.02μs      13.1±0.03μs     0.79  bench_ufunc_strides.Unary.time_ufunc('isinf', 2, 2, 'f')
-     17.3±0.09μs      13.7±0.08μs     0.79  bench_ufunc_strides.Unary.time_ufunc('isfinite', 2, 2, 'd')
-     15.6±0.08μs      12.3±0.02μs     0.79  bench_ufunc_strides.Unary.time_ufunc('signbit', 2, 4, 'f')
-     8.37±0.02μs      6.61±0.02μs     0.79  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'object'>)
-     1.63±0.01μs      1.29±0.01μs     0.79  bench_itemselection.PutMask.time_sparse(True, 'int16')
-     2.49±0.03μs      1.96±0.01μs     0.79  bench_array_coercion.ArrayCoercionSmall.time_array_all_kwargs(range(0, 3))
-     1.75±0.02μs      1.37±0.03μs     0.78  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'float32')
-     1.76±0.02μs      1.37±0.03μs     0.78  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'int64')
-     1.76±0.02μs      1.37±0.04μs     0.78  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'complex64')
-      19.3±0.2μs       15.1±0.3μs     0.78  bench_function_base.Sort.time_sort('merge', 'float64', ('reversed',))
-     15.6±0.06μs      12.1±0.07μs     0.78  bench_ufunc_strides.Unary.time_ufunc('signbit', 4, 2, 'f')
-     8.13±0.07μs      6.34±0.03μs     0.78  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'object'>)
-     1.75±0.02μs      1.36±0.03μs     0.78  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'int32')
-     1.02±0.03ms         791±30μs     0.78  bench_lib.Pad.time_pad((4, 4, 4, 4), (0, 32), 'constant')
-     15.4±0.07μs      11.9±0.05μs     0.77  bench_ufunc_strides.Unary.time_ufunc('signbit', 2, 2, 'f')
-     1.77±0.04μs      1.37±0.03μs     0.77  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'float64')
-      10.4±0.2μs       7.99±0.1μs     0.77  bench_shape_base.Block.time_no_lists(1)
-         358±6ns          273±7ns     0.76  bench_core.Core.time_array_1
-     2.99±0.06ms      2.28±0.03ms     0.76  bench_linalg.Einsum.time_einsum_noncon_outer(<class 'numpy.float64'>)
-      91.0±0.6μs       69.0±0.8μs     0.76  bench_function_base.Sort.time_sort('merge', 'int16', ('sorted_block', 1000))
-     2.33±0.02μs      1.76±0.02μs     0.76  bench_array_coercion.ArrayCoercionSmall.time_asarray(range(0, 3))
-     6.78±0.02μs       5.12±0.1μs     0.75  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'numpy.int16'>)
-     6.86±0.03μs       5.17±0.1μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int16'>)
-     7.07±0.01μs      5.33±0.05μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'numpy.int64'>)
-        1.02±0ms          771±5μs     0.75  bench_lib.Pad.time_pad((256, 128, 1), (0, 32), 'mean')
-     6.94±0.05μs      5.22±0.04μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int64'>)
-     13.0±0.03μs      9.80±0.04μs     0.75  bench_ufunc_strides.Unary.time_ufunc('signbit', 1, 4, 'f')
-     6.97±0.05μs       5.23±0.1μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'numpy.int32'>)
-      6.77±0.1μs         5.07±0μs     0.75  bench_core.CountNonzero.time_count_nonzero_axis(2, 100, <class 'numpy.int64'>)
-     6.71±0.01μs      5.03±0.06μs     0.75  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'numpy.int8'>)
-     6.62±0.02μs      4.95±0.08μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'numpy.int64'>)
-     6.79±0.03μs      5.08±0.09μs     0.75  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'numpy.int32'>)
-     6.69±0.07μs       5.00±0.1μs     0.75  bench_core.CountNonzero.time_count_nonzero_axis(2, 100, <class 'numpy.int16'>)
-     2.35±0.04μs      1.76±0.03μs     0.75  bench_array_coercion.ArrayCoercionSmall.time_asanyarray(range(0, 3))
-     3.35±0.02μs      2.50±0.01μs     0.75  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'longfloat')
-     6.33±0.02μs      4.73±0.07μs     0.75  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'numpy.int16'>)
-     6.98±0.04μs       5.21±0.1μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'numpy.int16'>)
-     6.86±0.01μs       5.11±0.1μs     0.74  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int32'>)
-     6.86±0.08μs      5.10±0.05μs     0.74  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int8'>)
-     3.36±0.02μs      2.49±0.02μs     0.74  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'complex128')
-     6.99±0.02μs      5.18±0.05μs     0.74  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'numpy.int64'>)
-     6.95±0.03μs      5.15±0.07μs     0.74  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'numpy.int8'>)
-     6.38±0.01μs      4.73±0.04μs     0.74  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'numpy.int64'>)
-     2.61±0.01μs      1.93±0.03μs     0.74  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'int32')
-     6.62±0.07μs      4.89±0.06μs     0.74  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'numpy.int32'>)
-     6.74±0.01μs      4.98±0.07μs     0.74  bench_core.CountNonzero.time_count_nonzero_axis(2, 100, <class 'numpy.int8'>)
-        6.37±0μs      4.70±0.06μs     0.74  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'numpy.int32'>)
-        2.83±0μs      2.08±0.01μs     0.74  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
-     2.61±0.01μs      1.92±0.03μs     0.74  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'float32')
-         319±5ns         235±10ns     0.74  bench_array_coercion.ArrayCoercionSmall.time_array_no_copy(array([5]))
-     6.78±0.09μs      4.99±0.02μs     0.74  bench_core.CountNonzero.time_count_nonzero_axis(2, 100, <class 'numpy.int32'>)
-     6.37±0.08μs      4.68±0.05μs     0.73  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'numpy.int8'>)
-     12.8±0.02μs      9.42±0.05μs     0.73  bench_ufunc_strides.Unary.time_ufunc('signbit', 1, 2, 'f')
-     8.73±0.01μs      6.40±0.02μs     0.73  bench_core.UnpackBits.time_unpackbits_little
-     6.74±0.04μs       4.94±0.1μs     0.73  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'numpy.int16'>)
-     6.20±0.04μs      4.54±0.03μs     0.73  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'bool'>)
-     2.24±0.01μs      1.64±0.02μs     0.73  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'int64')
-     1.08±0.02μs          790±1ns     0.73  bench_array_coercion.ArrayCoercionSmall.time_array_no_copy([1])
-     2.24±0.01μs      1.63±0.03μs     0.73  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'float64')
-     2.24±0.01μs      1.63±0.03μs     0.73  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'complex64')
-     2.26±0.01μs      1.64±0.03μs     0.73  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'longfloat')
-     6.41±0.02μs      4.64±0.05μs     0.72  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'bool'>)
-     6.75±0.08μs      4.89±0.04μs     0.72  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'numpy.int8'>)
-         324±7μs          234±3μs     0.72  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float64'>)
-     6.39±0.01μs      4.62±0.05μs     0.72  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'bool'>)
-     2.26±0.03μs      1.63±0.03μs     0.72  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'complex128')
-      6.12±0.1μs      4.41±0.01μs     0.72  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'bool'>)
-     6.22±0.01μs      4.46±0.05μs     0.72  bench_core.CountNonzero.time_count_nonzero_axis(2, 100, <class 'bool'>)
-     5.80±0.08μs      4.16±0.03μs     0.72  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'bool'>)
-     1.86±0.01μs      1.34±0.02μs     0.72  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'float32')
-     1.86±0.01μs      1.33±0.03μs     0.72  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'int16')
-     1.87±0.01μs      1.33±0.02μs     0.71  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'int32')
-     1.86±0.01μs      1.33±0.02μs     0.71  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'float16')
-         756±9μs          531±9μs     0.70  bench_lib.Pad.time_pad((256, 128, 1), (0, 32), 'constant')
-      10.6±0.1μs      7.46±0.07μs     0.70  bench_ufunc_strides.Unary.time_ufunc('isinf', 1, 4, 'f')
-     2.48±0.03μs      1.74±0.03μs     0.70  bench_array_coercion.ArrayCoercionSmall.time_ascontiguousarray(range(0, 3))
-      10.5±0.1μs      7.31±0.05μs     0.70  bench_ufunc_strides.Unary.time_ufunc('isnan', 1, 4, 'f')
-      149±0.04μs        104±0.1μs     0.70  bench_core.UnpackBits.time_unpackbits_axis1_little
-     3.73±0.01μs      2.60±0.01μs     0.70  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
-      12.0±0.1μs      8.29±0.02μs     0.69  bench_ufunc_strides.Unary.time_ufunc('isfinite', 1, 2, 'd')
-     1.19±0.01μs          821±4ns     0.69  bench_array_coercion.ArrayCoercionSmall.time_asanyarray_dtype([1])
-     1.19±0.01μs         822±10ns     0.69  bench_array_coercion.ArrayCoercionSmall.time_asarray_dtype([1])
-     10.4±0.01μs      7.05±0.04μs     0.68  bench_ufunc_strides.Unary.time_ufunc('isinf', 1, 2, 'f')
-     10.2±0.02μs      6.81±0.04μs     0.67  bench_ufunc_strides.Unary.time_ufunc('isnan', 1, 2, 'f')
-     1.20±0.02μs          799±1ns     0.66  bench_array_coercion.ArrayCoercionSmall.time_array_subok([1])
-     1.48±0.02μs          976±9ns     0.66  bench_array_coercion.ArrayCoercionSmall.time_array_all_kwargs([1])
-      11.9±0.2μs      7.73±0.07μs     0.65  bench_function_base.Sort.time_sort('merge', 'int64', ('uniform',))
-     11.8±0.03μs      7.54±0.07μs     0.64  bench_function_base.Sort.time_sort('merge', 'int64', ('ordered',))
-       306±0.7μs        195±0.7μs     0.64  bench_core.Indices.time_indices
-       660±0.8ns          409±4ns     0.62  bench_array_coercion.ArrayCoercionSmall.time_array_no_copy(5)
-        714±10ns          436±1ns     0.61  bench_array_coercion.ArrayCoercionSmall.time_array_no_copy(1)
-         278±1μs        169±0.9μs     0.61  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
-     91.7±0.09μs      54.5±0.05μs     0.59  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
-       183±0.1μs        109±0.5μs     0.59  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
-         812±1ns         482±10ns     0.59  bench_array_coercion.ArrayCoercionSmall.time_asanyarray_dtype(5)
-         973±1ns          574±9ns     0.59  bench_array_coercion.ArrayCoercionSmall.time_array_invalid_kwarg(1)
-         978±2ns          571±8ns     0.58  bench_array_coercion.ArrayCoercionSmall.time_array_invalid_kwarg(range(0, 3))
-     10.2±0.02μs      5.91±0.01μs     0.58  bench_function_base.Sort.time_sort('merge', 'int16', ('ordered',))
-         821±2ns         477±10ns     0.58  bench_array_coercion.ArrayCoercionSmall.time_asarray_dtype(5)
-     10.2±0.01μs      5.91±0.01μs     0.58  bench_function_base.Sort.time_sort('merge', 'int16', ('uniform',))
-         986±1ns          572±9ns     0.58  bench_array_coercion.ArrayCoercionSmall.time_array_invalid_kwarg(5)
-         981±1ns          569±5ns     0.58  bench_array_coercion.ArrayCoercionSmall.time_array_invalid_kwarg([1])
-         919±5ns         532±20ns     0.58  bench_array_coercion.ArrayCoercionSmall.time_asarray_dtype(array([5]))
-         920±9ns         532±20ns     0.58  bench_array_coercion.ArrayCoercionSmall.time_asanyarray_dtype(array([5]))
-         799±5ns          458±6ns     0.57  bench_array_coercion.ArrayCoercionSmall.time_asarray_dtype(1)
-         798±2ns          456±4ns     0.57  bench_array_coercion.ArrayCoercionSmall.time_asanyarray_dtype(1)
-     1.26±0.02μs          716±1ns     0.57  bench_array_coercion.ArrayCoercionSmall.time_asarray([1])
-        10.2±0μs      5.76±0.01μs     0.56  bench_io.CopyTo.time_copyto_dense
-     1.29±0.03μs          716±6ns     0.56  bench_array_coercion.ArrayCoercionSmall.time_asanyarray([1])
-     1.15±0.01μs          633±6ns     0.55  bench_array_coercion.ArrayCoercionSmall.time_array_all_kwargs(5)
-     1.04±0.01μs          572±7ns     0.55  bench_array_coercion.ArrayCoercionSmall.time_array_invalid_kwarg(array([5]))
-         896±4ns         483±10ns     0.54  bench_array_coercion.ArrayCoercionSmall.time_array_subok(array([5]))
-     9.72±0.04μs      5.22±0.02μs     0.54  bench_io.Copy.time_cont_assign('complex64')
-     1.13±0.02μs          601±8ns     0.53  bench_array_coercion.ArrayCoercionSmall.time_array_all_kwargs(1)
-         788±1ns          420±1ns     0.53  bench_array_coercion.ArrayCoercionSmall.time_array_subok(5)
-     9.71±0.07μs      5.17±0.04μs     0.53  bench_io.Copy.time_cont_assign('float64')
-       835±0.7ns          440±1ns     0.53  bench_array_coercion.ArrayCoercionSmall.time_array_subok(1)
-     1.42±0.01μs        705±0.5ns     0.50  bench_array_coercion.ArrayCoercionSmall.time_ascontiguousarray([1])
-         942±5ns          463±2ns     0.49  bench_array_coercion.ArrayCoercionSmall.time_array_all_kwargs(array([5]))
-       833±0.4ns        381±0.5ns     0.46  bench_array_coercion.ArrayCoercionSmall.time_asarray(5)
-         890±3ns          402±3ns     0.45  bench_array_coercion.ArrayCoercionSmall.time_asarray(1)
-         892±2ns          401±4ns     0.45  bench_array_coercion.ArrayCoercionSmall.time_asanyarray(1)
-       837±0.5ns        377±0.6ns     0.45  bench_array_coercion.ArrayCoercionSmall.time_asanyarray(5)
-        1.12±0μs          463±7ns     0.41  bench_array_coercion.ArrayCoercionSmall.time_ascontiguousarray(1)
-        1.07±0μs          444±3ns     0.41  bench_array_coercion.ArrayCoercionSmall.time_ascontiguousarray(5)
-         434±3ns        179±0.6ns     0.41  bench_array_coercion.ArrayCoercionSmall.time_asanyarray(array([5]))
-       436±0.9ns        178±0.2ns     0.41  bench_array_coercion.ArrayCoercionSmall.time_asarray(array([5]))
-       445±0.7ns        177±0.6ns     0.40  bench_array_coercion.ArrayCoercionSmall.time_ascontiguousarray(array([5]))

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@seberg seberg force-pushed the splitup-faster-argparsing-infrastructure branch from fc5cb20 to 7a8d798 Compare March 11, 2021 22:13
@seberg
Copy link
Copy Markdown
Member Author

seberg commented Mar 11, 2021

One note here, is that the codecov failures are almost all related to errors that are only relevant when adding new functions with using npy_argparse. I think the only exception is the integer parsing case, but I am pretty sure those are covered eventually in the later PRs.

@mattip
Copy link
Copy Markdown
Member

mattip commented Mar 17, 2021

I think the only exception is the integer parsing case, but I am pretty sure those are covered eventually in the later PRs.

Later PRs or later changesets in this PR? Coverage still claims the conversion errors are not hit.

This is a fast argument parser (an original version also supported
dictionary unpacking (for args, kwargs call style) which supports
the new FASTCALL and VECTORCALL convention of CPython. Fastcall
is supported by all Python versions we support.

This allows todrastically reduces the overhead of methods when keyword
arguments are passed.
@seberg seberg force-pushed the splitup-faster-argparsing-infrastructure branch from 7a8d798 to b83847c Compare March 18, 2021 01:47
@seberg
Copy link
Copy Markdown
Member Author

seberg commented Mar 18, 2021

I expect that most of this gets tested in the ufunc/np.array cleanup PR, but I added some explicit test cases that should cover most error-paths now (with the exception of the "programmer error" paths).

EDIT: Tried close/reopen, since codecov seemed to ignore the complete file now...

seberg added 2 commits March 17, 2021 21:01
Array methods are the easiest target for the new parser. They
do not require any larger improvements and most functions in
multiarray are wrapped. Similarly the printing functions
are not dispatched through `__array_function__` and thus have the
most to gain.
@seberg seberg force-pushed the splitup-faster-argparsing-infrastructure branch from b83847c to 9a45332 Compare March 18, 2021 02:02
@seberg seberg closed this Mar 18, 2021
@seberg seberg reopened this Mar 18, 2021

def test_invalid_integers():
with pytest.raises(TypeError,
match="integer argument expected, got float"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test running? It should be triggering code coverage of the new functions (the error message comes fromPyArray_PythonPyIntFromInt) but coverage claims otherwise. I wonder what is going on?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely gets run. Confuses me as well, the best idea I have is that the inclusion/compilation both with _multiarray_tests and with _multiarray_umath confuses the coverage (it no seems to report no coverage at all)? Which would mean that mem_overlap coverage should also be off, and that sounds a bit familiar.

@mattip mattip merged commit 267d49f into numpy:main Mar 18, 2021
@mattip
Copy link
Copy Markdown
Member

mattip commented Mar 18, 2021

Thanks @seberg

@seberg
Copy link
Copy Markdown
Member Author

seberg commented Mar 18, 2021

Oh wow, thanks, let me do some of the rebasing due to this, and probably write brief docs for the developers corner.

@seberg seberg deleted the splitup-faster-argparsing-infrastructure branch March 18, 2021 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants