ENH: Introduce tracer for enabled CPU targets on each optimized function#24420
ENH: Introduce tracer for enabled CPU targets on each optimized function#24420charris merged 3 commits intonumpy:mainfrom
Conversation
2579bd9 to
105feaa
Compare
46544dd to
b4361f0
Compare
|
This looks pretty cool! Should make it a lot easier to figure out both what's available and what's being used.
We may want to put this elsewhere,
How significant is the runtime overhead? |
Sounds good to me.
The concern with sorting necessitates refactoring the runtime dispatch to enable dispatching only during the load of Numpy, similar to argmax and argmin. Refer to: link. However, this can be handled via C++ static initialization to guarantee only one call, which may affect small arrays. See: link. In general, it definitely impacts the speed of loading the numpy module. It makes several Python API calls to insert an entry into the global dict cpu_targets_info for each runtime dispatch. Refer to: link. |
|
Looks good. Could you document its use somewhere (I'm not sure where a good place would be) and add a release note. Maybe we need a new document section on tracing and performance tracking. |
Agreed, for now I'm going to add it under https://numpy.org/doc/stable/reference/simd/build-options.html#runtime-dispatch. |
|
This is definitely a good step in the right direction: it tells which features will be used with which dtypes to determine runtime dispatching. But it is not the whole story, since additionally there are choices made due to memory overlap (rare) or strides (contiguous/non-contiguous) or shape (too small for BLAS/SIMD, square arrays and 1d arrays use a different path through matmul). I was dreaming of a decorator that would report exactly which loop was used in a particular ufunc call. Maybe that is too hard to do. It would also be nice to get a report when BLAS is used, which might have helped debug #24512. |
|
We touched upon this in the community meeting today - |
17316e9 to
d451f0b
Compare
This pull request is aimed at tracking the enabled CPU targets, rather than delving into debugging inner SIMD branches. The compiler retains the ability to optimize scalar operations by utilizing native instructions compatible with the enabled targets. For instance, FMA native operations matter for tracking precision loss/gain. However, introducing the capability to track such branches could be achieved by providing a specialized build option, such as
To precisely identify the SIMD branches based on the arguments, generating a backtrace will be required, not too hard but again not going to fit the release build.
Yes that possible but Is it already covered within |
d451f0b to
96909b9
Compare
|
Need rebase. |
seprated header This should should be removed once we drop the support of disutils
…tion
This update introduces a tracer mechanism that enables tracking of the enabled targets
for each optimized function in the NumPy library. With this enhancement,
it becomes possible to precisely monitor the enabled CPU dispatch
targets for the dispatched functions.
A new function named `opt_func_info` has been added to the new namespace `numpy.lib.introspect` module,
offering this tracing capability. This function allows you to retrieve information
about the enabled targets based on function names and data type signatures.
Here's an example of how to use it:
```python
>> func_info = numpy.lib.introspect.opt_func_info(func_name='add|abs', signature='float64|complex64')
>> print(json.dumps(func_info, indent=2))
{
"absolute": {
"dd": {
"current": "SSE41",
"available": "SSE41 baseline(SSE SSE2 SSE3)"
},
"Ff": {
"current": "FMA3__AVX2",
"available": "AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)"
},
"Dd": {
"current": "FMA3__AVX2",
"available": "AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)"
}
},
"add": {
"ddd": {
"current": "FMA3__AVX2",
"available": "FMA3__AVX2 baseline(SSE SSE2 SSE3)"
},
"FFF": {
"current": "FMA3__AVX2",
"available": "FMA3__AVX2 baseline(SSE SSE2 SSE3)"
}
}
}
```
For tracer utilization, remember to invoke the new `NPY_CPU_DISPATCH_TRACE()`
macro either before or after employing `NPY_CPU_DISPATCH_CALL()` for dispatching.
For more clarification, please refer to the header
`numpy/core/src/common/npy_cpu_dispatch.h`.
As part of this solution, a new dictionary, `__cpu_targets_info__`, has been introduced within
the `numpy.core._multiarray_umath` module. This dictionary contains relevant data
about enabled targets for each optimized function.
As of now, the tracing mechanism covers ufunc-based functions, `argmax`, and `argmin`
However, functions like sorting operations may require refactoring due to the
tracer's associated cost.
It's noteworthy that the tracer should be called only once during the initialization
of Python C functions to avoid performance regressions.
96909b9 to
0972e6a
Compare
done |
|
Thanks Sayed. The backport of this looks to be tricky due to all the file renames in main, so it might not make it into 1.26. |
|
This is a new feature, so it probably wasn't right to backport it to 1.26.0 anyway - we were aiming for zero changes beyond build system swap and bug fixes. |
|
I'll leave this for 2.0.0. |
SIMD: Introduce tracer for enabled CPU targets on each optimized function
This update introduces a tracer mechanism that enables tracking of the enabled targets
for each optimized function in the NumPy library. With this enhancement,
it becomes possible to precisely monitor the enabled CPU dispatch
targets for the dispatched functions.
A new function named
opt_func_infohas been added to thenumpy.lib.utilsmodule,offering this tracing capability. This function allows you to retrieve information
about the enabled targets based on function names and data type signatures.
Here's an example of how to use it:
For tracer utilization, remember to invoke the new
NPY_CPU_DISPATCH_TRACE()macro either before or after employing
NPY_CPU_DISPATCH_CALL()for dispatching.For more clarification, please refer to the header
numpy/core/src/common/npy_cpu_dispatch.h.As part of this solution, a new dictionary,
__cpu_targets_info__, has been introduced withinthe
numpy.core._multiarray_umathmodule. This dictionary contains relevant dataabout enabled targets for each optimized function.
As of now, the tracing mechanism covers ufunc-based functions,
argmax, andargminHowever, functions like sorting operations may require refactoring due to the
tracer's associated cost.
It's noteworthy that the tracer should be called only once during the initialization
of Python C functions to avoid regressions.