Skip to content

MAINT,API: ignore and NULL fasttake/fastputmask ArrFuncs slots#14942

Merged
mattip merged 3 commits intonumpy:masterfrom
seberg:clean-fasttake
Jan 16, 2020
Merged

MAINT,API: ignore and NULL fasttake/fastputmask ArrFuncs slots#14942
mattip merged 3 commits intonumpy:masterfrom
seberg:clean-fasttake

Conversation

@seberg
Copy link
Member

@seberg seberg commented Nov 19, 2019

Always NULL the fasttake slot, and instead move the optimized versions
by doing itemsize specialization in the Take function itself.


I still need to check that there is now speed regression.

@seberg seberg changed the title MAINT,API: Do not use fasttake slot of ArrFuncs and null it MAINT,API: Do not use fasttake slot of ArrFuncs and NULL it Nov 19, 2019
@seberg seberg force-pushed the clean-fasttake branch 3 times, most recently from 857b0cd to 8d7cad0 Compare November 19, 2019 22:11
@seberg
Copy link
Member Author

seberg commented Nov 19, 2019

Think I addressed those, will squash in a bit. Maybe it is due to debug build, but now changed it to split up on chunk (because why not, it just speeds up certain cases). And it seems memmove is much faster even for 3 bools (which miss the specialized loops), due to memmove rather than a loop for 0 to 3. So overall, it seems like this is a pretty big speed win actually, plus it makes the error message cleaner...

@seberg
Copy link
Member Author

seberg commented Nov 19, 2019

I want to do the same thing for fastput, planning to do a new PR for that though right now.

@seberg seberg changed the title MAINT,API: Do not use fasttake slot of ArrFuncs and NULL it MAINT,API: Do not use fasttake/fastputmask slots of ArrFuncs and NULL them Nov 20, 2019
@seberg seberg changed the title MAINT,API: Do not use fasttake/fastputmask slots of ArrFuncs and NULL them MAINT,API: NULL fasttake/fastputmask slots of ArrFuncs and ignore them Nov 20, 2019
@seberg seberg changed the title MAINT,API: NULL fasttake/fastputmask slots of ArrFuncs and ignore them MAINT,API: ignore and NULL fasttake/fastputmask slots of ArrFuncs Nov 20, 2019
@seberg seberg changed the title MAINT,API: ignore and NULL fasttake/fastputmask slots of ArrFuncs MAINT,API: ignore and NULL fasttake/fastputmask ArrFuncs slots Nov 20, 2019
@seberg
Copy link
Member Author

seberg commented Nov 20, 2019

The current timings are below. Basically, take is slower for 3 elements (which is too bad, since it was a strength of the old take compared to advanced indexing. That could probably be resolved by putting in the loop over the inner-size instead of the large chunk memmove.

For some reason complex256 is optimized much better in take (maybe alignment? maybe just compiler fluctuations). fastputmask is a bit slower, not sure why, but usually it is not a huge amount. And it is not a very common function.

Details
       before           after         ratio
     [01123f61]       [f75bf201]
     <clean-fasttake~3>                 
+      6.42±0.1μs      10.3±0.05μs     1.60  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'complex256')
+      3.62±0.1μs      5.64±0.03μs     1.56  bench_itemselection.PutMask.time_dense(False, 'complex256')
+      6.16±0.1μs       9.46±0.1μs     1.54  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'complex256')
+      6.73±0.1μs      10.3±0.07μs     1.53  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'complex256')
+      4.00±0.2μs      5.97±0.03μs     1.49  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'complex256')
+     4.14±0.08μs      6.02±0.05μs     1.46  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'complex256')
+     3.84±0.05μs      5.53±0.03μs     1.44  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'complex256')
+     6.18±0.09μs      7.94±0.02μs     1.28  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int16')
+      8.39±0.2μs       10.2±0.3μs     1.21  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'complex256')
       6.14±0.9μs      7.40±0.02μs    ~1.20  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'float16')
       8.37±0.4μs       10.1±0.2μs    ~1.20  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'complex256')
+     5.87±0.04μs      6.97±0.02μs     1.19  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int32')
+     6.06±0.09μs       7.10±0.2μs     1.17  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'complex256')
+     6.04±0.05μs      7.04±0.04μs     1.17  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'complex64')
         8.92±2μs       10.4±0.2μs    ~1.16  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'complex256')
+     6.16±0.06μs      7.09±0.07μs     1.15  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'complex256')
+      7.04±0.1μs      7.95±0.02μs     1.13  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'float16')
+     6.38±0.09μs       7.16±0.1μs     1.12  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'complex256')
       6.29±0.2μs      7.06±0.02μs    ~1.12  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int64')
+     2.05±0.01μs         2.28±0μs     1.11  bench_itemselection.PutMask.time_dense(True, 'float16')
+        2.64±0μs      2.93±0.01μs     1.11  bench_itemselection.PutMask.time_dense(False, 'int64')
+        2.06±0μs         2.28±0μs     1.11  bench_itemselection.PutMask.time_dense(True, 'float32')
+        2.10±0μs      2.31±0.01μs     1.10  bench_itemselection.PutMask.time_dense(True, 'int64')
+     2.06±0.01μs         2.27±0μs     1.10  bench_itemselection.PutMask.time_dense(True, 'int16')
      2.65±0.03μs      2.90±0.01μs     1.10  bench_itemselection.PutMask.time_dense(False, 'complex64')
      2.58±0.01μs      2.80±0.01μs     1.08  bench_itemselection.PutMask.time_dense(False, 'int16')
      2.58±0.01μs      2.80±0.01μs     1.08  bench_itemselection.PutMask.time_dense(False, 'float16')
       2.69±0.2μs      2.90±0.02μs     1.08  bench_itemselection.PutMask.time_dense(False, 'float64')
      2.60±0.01μs         2.81±0μs     1.08  bench_itemselection.PutMask.time_dense(False, 'float32')
      2.60±0.01μs         2.80±0μs     1.08  bench_itemselection.PutMask.time_dense(False, 'int32')
      4.91±0.04μs      5.28±0.05μs     1.07  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'complex128')
       5.18±0.2μs      5.55±0.05μs     1.07  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'complex128')
      6.06±0.06μs      6.47±0.04μs     1.07  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'complex64')
       6.13±0.2μs      6.55±0.06μs     1.07  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'float64')
       7.04±0.4μs      7.39±0.02μs     1.05  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'float16')
      7.05±0.03μs      7.39±0.02μs     1.05  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'int16')
      3.32±0.03μs      3.47±0.03μs     1.05  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'complex128')
       6.14±0.6μs      6.40±0.04μs     1.04  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'float32')
         2.56±0μs      2.62±0.01μs     1.02  bench_itemselection.PutMask.time_dense(True, 'complex128')
       6.31±0.2μs      6.44±0.08μs     1.02  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'int32')
         2.33±0μs      2.37±0.01μs     1.02  bench_itemselection.PutMask.time_sparse(False, 'complex256')
      2.96±0.04μs      3.02±0.05μs     1.02  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'float16')
      2.99±0.03μs      3.04±0.03μs     1.02  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'int16')
      4.33±0.03μs      4.35±0.02μs     1.00  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'int16')
      4.36±0.02μs      4.37±0.03μs     1.00  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'float16')
       7.03±0.1μs      7.04±0.07μs     1.00  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'float64')
      4.43±0.04μs      4.43±0.04μs     1.00  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'complex64')
       4.83±0.2μs      4.80±0.02μs     0.99  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'int16')
      3.20±0.03μs      3.18±0.01μs     0.99  bench_itemselection.PutMask.time_dense(False, 'complex128')
      3.23±0.02μs      3.20±0.03μs     0.99  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'float16')
      4.43±0.02μs      4.39±0.02μs     0.99  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'int64')
      4.84±0.02μs      4.80±0.01μs     0.99  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'float16')
      4.98±0.02μs      4.92±0.04μs     0.99  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'int64')
      4.97±0.05μs      4.91±0.02μs     0.99  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'float64')
      3.27±0.02μs      3.23±0.02μs     0.99  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'int16')
         2.33±0μs      2.30±0.01μs     0.99  bench_itemselection.PutMask.time_dense(True, 'float64')
      4.39±0.02μs      4.34±0.03μs     0.99  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'float32')
      3.08±0.04μs      3.04±0.01μs     0.99  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'complex64')
      4.41±0.03μs      4.35±0.02μs     0.99  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'float64')
      2.31±0.01μs      2.28±0.01μs     0.99  bench_itemselection.PutMask.time_dense(True, 'int32')
         2.31±0μs      2.27±0.01μs     0.98  bench_itemselection.PutMask.time_sparse(False, 'float64')
      4.92±0.02μs      4.84±0.04μs     0.98  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'float32')
      4.96±0.04μs      4.86±0.01μs     0.98  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'complex64')
      2.35±0.01μs      2.30±0.01μs     0.98  bench_itemselection.PutMask.time_dense(True, 'complex64')
      4.95±0.04μs      4.84±0.03μs     0.98  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'int32')
      3.35±0.03μs      3.26±0.04μs     0.97  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'float64')
      3.09±0.02μs      3.00±0.03μs     0.97  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'int64')
      2.34±0.01μs         2.27±0μs     0.97  bench_itemselection.PutMask.time_sparse(False, 'int16')
      3.35±0.03μs      3.23±0.02μs     0.96  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'float32')
      3.33±0.03μs      3.21±0.03μs     0.96  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'int32')
      3.50±0.02μs      3.37±0.08μs     0.96  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'complex128')
      3.12±0.03μs      3.00±0.05μs     0.96  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'float64')
      3.38±0.04μs      3.25±0.02μs     0.96  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'complex64')
      3.38±0.02μs      3.24±0.03μs     0.96  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'int64')
      3.08±0.03μs      2.96±0.03μs     0.96  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'float32')
      7.25±0.09μs      6.90±0.04μs     0.95  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'float32')
         7.83±2μs      7.41±0.08μs     0.95  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'int16')
      5.35±0.04μs      5.04±0.03μs     0.94  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'complex128')
      3.17±0.02μs      2.94±0.02μs     0.93  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'int32')
      4.67±0.03μs      4.32±0.03μs     0.93  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'int32')
      3.34±0.03μs      3.07±0.04μs     0.92  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'complex128')
-     3.05±0.04μs      2.75±0.04μs     0.90  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'float64')
-     6.77±0.05μs      6.04±0.04μs     0.89  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'complex128')
-     4.98±0.05μs      4.44±0.07μs     0.89  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'complex128')
-     3.09±0.02μs      2.75±0.03μs     0.89  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'int64')
-     2.98±0.03μs      2.65±0.05μs     0.89  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'int16')
-     3.10±0.02μs      2.75±0.02μs     0.89  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'complex64')
-     4.39±0.04μs      3.89±0.02μs     0.89  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'float64')
         7.42±2μs      6.57±0.04μs    ~0.88  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'complex64')
-     2.96±0.04μs         2.62±0μs     0.88  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'float16')
-     3.05±0.01μs      2.69±0.02μs     0.88  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'float32')
-        2.29±0μs      2.00±0.01μs     0.87  bench_itemselection.PutMask.time_sparse(True, 'complex128')
-     4.44±0.02μs      3.86±0.03μs     0.87  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'complex64')
-     2.30±0.01μs      2.00±0.02μs     0.87  bench_itemselection.PutMask.time_sparse(True, 'complex64')
-     4.31±0.03μs      3.75±0.05μs     0.87  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'int16')
-     4.51±0.04μs      3.92±0.07μs     0.87  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'int64')
-     4.33±0.09μs      3.76±0.02μs     0.87  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'float16')
      7.26±0.05μs       6.27±0.1μs    ~0.86  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'complex128')
-     2.31±0.02μs         1.99±0μs     0.86  bench_itemselection.PutMask.time_sparse(True, 'int16')
-     4.42±0.06μs      3.82±0.02μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'float32')
-     7.01±0.07μs      6.04±0.06μs     0.86  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'complex128')
-     3.08±0.01μs      2.65±0.02μs     0.86  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'int32')
-     2.31±0.03μs         1.99±0μs     0.86  bench_itemselection.PutMask.time_sparse(True, 'int32')
-     2.33±0.04μs         1.99±0μs     0.86  bench_itemselection.PutMask.time_sparse(True, 'float64')
-      4.49±0.2μs      3.83±0.02μs     0.85  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'int32')
         7.61±2μs      6.47±0.02μs    ~0.85  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'int32')
-     2.38±0.01μs      2.00±0.01μs     0.84  bench_itemselection.PutMask.time_sparse(True, 'longfloat')
-        2.80±0μs      2.27±0.01μs     0.81  bench_itemselection.PutMask.time_sparse(False, 'int32')
-        8.10±1μs      6.51±0.05μs     0.80  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'float64')
-        7.04±0μs      5.64±0.04μs     0.80  bench_itemselection.PutMask.time_dense(True, 'complex256')
-        8.41±1μs      6.55±0.04μs     0.78  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'int64')
-     2.99±0.02μs      2.27±0.01μs     0.76  bench_itemselection.PutMask.time_sparse(False, 'longfloat')
-        8.54±2μs      6.47±0.02μs     0.76  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'float32')
-     3.02±0.01μs         2.28±0μs     0.75  bench_itemselection.PutMask.time_sparse(False, 'int64')
-     3.02±0.01μs         2.27±0μs     0.75  bench_itemselection.PutMask.time_sparse(False, 'complex64')
-     3.02±0.01μs      2.27±0.01μs     0.75  bench_itemselection.PutMask.time_sparse(False, 'complex128')
-      8.72±0.9μs      6.54±0.04μs     0.75  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'int64')
-     3.05±0.02μs      2.29±0.01μs     0.75  bench_itemselection.PutMask.time_sparse(False, 'float16')
         8.67±2μs      6.45±0.02μs    ~0.74  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'complex128')
-     3.07±0.01μs      2.28±0.01μs     0.74  bench_itemselection.PutMask.time_sparse(False, 'float32')
-      5.08±0.2μs      3.45±0.06μs     0.68  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'complex64')
-     4.97±0.04μs      3.34±0.02μs     0.67  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'float64')
-     3.10±0.02μs         2.08±0μs     0.67  bench_itemselection.PutMask.time_sparse(True, 'complex256')
-     9.09±0.05μs      6.05±0.03μs     0.67  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'longfloat')
-     3.01±0.01μs         1.99±0μs     0.66  bench_itemselection.PutMask.time_sparse(True, 'float16')
-      9.16±0.7μs      6.04±0.07μs     0.66  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'longfloat')
-     3.03±0.01μs      1.99±0.01μs     0.66  bench_itemselection.PutMask.time_sparse(True, 'float32')
-     5.21±0.05μs      3.41±0.03μs     0.65  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'complex64')
-      3.06±0.2μs         1.99±0μs     0.65  bench_itemselection.PutMask.time_sparse(True, 'int64')
-     5.36±0.04μs      3.44±0.04μs     0.64  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'int64')
-     5.38±0.02μs      3.43±0.04μs     0.64  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'longfloat')
-     5.57±0.07μs      3.49±0.05μs     0.63  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'int64')
-     5.21±0.03μs      3.25±0.03μs     0.62  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'int32')
-     4.98±0.04μs      3.09±0.07μs     0.62  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'float64')
-     5.65±0.02μs      3.46±0.03μs     0.61  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'float64')
-     5.33±0.03μs      3.22±0.02μs     0.60  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'int16')
-     4.99±0.05μs      3.00±0.04μs     0.60  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'int16')
-     5.09±0.04μs      3.03±0.01μs     0.60  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'int32')
-     5.05±0.04μs      2.98±0.03μs     0.59  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'complex64')
-     4.45±0.01μs      2.63±0.01μs     0.59  bench_itemselection.PutMask.time_dense(True, 'longfloat')
-     9.01±0.06μs      5.28±0.03μs     0.59  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'longfloat')
-        11.8±2μs      6.84±0.03μs     0.58  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'complex128')
-      12.0±0.1μs      6.77±0.06μs     0.56  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'longfloat')
-     4.96±0.01μs      2.70±0.03μs     0.54  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'int32')
-     5.01±0.03μs      2.72±0.02μs     0.54  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'float32')
-     4.98±0.02μs      2.66±0.03μs     0.53  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'float16')
-     5.87±0.03μs      3.07±0.03μs     0.52  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'longfloat')
-     5.91±0.05μs      3.07±0.04μs     0.52  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'int64')
-     10.7±0.06μs      5.52±0.03μs     0.52  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'longfloat')
-     12.3±0.06μs      6.28±0.08μs     0.51  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'longfloat')
-     6.52±0.04μs      3.23±0.04μs     0.50  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'float16')
-     6.53±0.03μs      3.19±0.02μs     0.49  bench_itemselection.PutMask.time_dense(False, 'longfloat')
-     6.88±0.03μs      3.35±0.04μs     0.49  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'longfloat')
-     6.97±0.04μs      3.29±0.02μs     0.47  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'float32')
-      6.39±0.1μs      2.95±0.01μs     0.46  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'float16')
-     9.98±0.04μs      4.44±0.04μs     0.44  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'longfloat')
-      6.39±0.4μs      2.68±0.04μs     0.42  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'int16')
-      12.0±0.2μs      5.00±0.04μs     0.42  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'longfloat')
-        17.3±4μs      6.49±0.03μs     0.38  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'longfloat')
-        8.42±1μs      3.03±0.03μs     0.36  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'float32')

@charris charris added 03 - Maintenance component: numpy._core 63 - C API Changes or additions to the C API. Mailing list should usually be notified. labels Nov 20, 2019
@seberg seberg force-pushed the clean-fasttake branch 2 times, most recently from f02ee2f to 6e32571 Compare November 20, 2019 23:04
@seberg
Copy link
Member Author

seberg commented Nov 20, 2019

OK, I hope this is more or less converging. There is a slightly slowdown for certain takes, but I am not too concerned, and it could be adapted later if necessary. The complex256 slowdown is somewhat annoying, but I do not think it really matters much and may well just get magically fixed with other compilers or so.

@seberg
Copy link
Member Author

seberg commented Nov 21, 2019

Hmmm, the azure error seems real. While float96 should not be optimized, something goes horribly wrong. Does memmove have alignment requirements, or is this a compiler bug with the -O3 optimization level? Grrr....

@seberg
Copy link
Member Author

seberg commented Nov 21, 2019

Oooops, it was a stupid typo... The issue was only that none of the tests hit the default path, except on the 32bit (actually linux) platform.

seberg added 2 commits January 7, 2020 12:19
Always NULL the fasttake slot, and instead move the optimized versions
by doing itemsize specialization in the Take function itself.
This removes the use of the fastputmask ArrFuncs slot from within
NumPy and always leaves it NULL.
@seberg seberg added triage review Issue/PR to be discussed at the next triage meeting triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Jan 15, 2020
@seberg
Copy link
Member Author

seberg commented Jan 15, 2020

From the call: It sounds like nobody is too concerned about ignoring these slots and a slight difference in speed trade-offs (and I guess if someone notices slowdowns we can optimize). Technical review awaiting of course.

Fasttake and fastputmask slots are deprecated and NULL'ed
---------------------------------------------------------
The fasttake and fastputmask slots are now never used internally.
They will always set to NULL for builtin dtypes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"They will always be NULL" or "They will always be set to NULL ..."

@mattip
Copy link
Member

mattip commented Jan 15, 2020

In PyArray_RegisterDataType, maybe error out if the fields are not NULL?

@seberg
Copy link
Member Author

seberg commented Jan 15, 2020

Sounds like a good idea, although maybe a warning (forever) that it is just ignored is just as good? Removes any possible breaking, and at some point we have to deprecate that whole function anyway...

@mattip
Copy link
Member

mattip commented Jan 16, 2020

any possible breaking

Up to you. I would argue that

  • anyone taking the effort to write a function that is then ignored might appreciate getting an error rather than a warning
  • it is hard to find a non-existent call to their function in the code base, leading to a hard-to-debug situation where their code is never called by NumPy but may be called by some third party package

When we remove the functions the error will go away too.

return -1;
}

if (test_deprecated_arrfuncs_members(f) < 0) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put into its own helper in anticiapation of more functions joining. In particular fastclip is already deprecated at execution time, and should be added here as well now (as well as get the same documentation updates). Planning to do that once this is in.

@seberg
Copy link
Member Author

seberg commented Jan 16, 2020

Testing this seems a bit annoying, so I only tested all the cases manually right now... I can try to think of a way to test it reasonably...

@seberg
Copy link
Member Author

seberg commented Jan 16, 2020

I opted for warning, the alternative is to use the clip warning at the place where the calculation happens, but I do not think that makes sense here: chances are you are not actually doing that calculation, and the results should never change. You are right for someone who writes a dtype, I am thinking of the possibility of someone only using one.

As mentioned above: I think clip should be deprecated in both places (here would be enough I suppose, but I do not feel like removing the other one), I will do that as a followup.

@mattip mattip merged commit feb0794 into numpy:master Jan 16, 2020
@mattip
Copy link
Member

mattip commented Jan 16, 2020

Thanks @seberg

@seberg seberg deleted the clean-fasttake branch January 16, 2020 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

03 - Maintenance 63 - C API Changes or additions to the C API. Mailing list should usually be notified. component: numpy._core triaged Issue/PR that was discussed in a triage meeting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants