ENH: Introduce tracer for enabled CPU targets on each optimized function by seiko2plus · Pull Request #24420 · numpy/numpy

seiko2plus · 2023-08-14T22:42:07Z

SIMD: Introduce tracer for enabled CPU targets on each optimized function

This update introduces a tracer mechanism that enables tracking of the enabled targets
for each optimized function in the NumPy library. With this enhancement,
it becomes possible to precisely monitor the enabled CPU dispatch
targets for the dispatched functions.

A new function named opt_func_info has been added to the numpy.lib.utils module,
offering this tracing capability. This function allows you to retrieve information
about the enabled targets based on function names and data type signatures.

Here's an example of how to use it:

>> func_info = numpy.lib.utils.opt_func_info(func_name='add|abs', signature='float64|complex64')
>> print(json.dumps(func_info, indent=2))
{
  "absolute": {
    "dd": {
      "current": "SSE41",
      "available": "SSE41 baseline(SSE SSE2 SSE3)"
    },
    "Ff": {
      "current": "FMA3__AVX2",
      "available": "AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)"
    },
    "Dd": {
      "current": "FMA3__AVX2",
      "available": "AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)"
    }
  },
  "add": {
    "ddd": {
      "current": "FMA3__AVX2",
      "available": "FMA3__AVX2 baseline(SSE SSE2 SSE3)"
    },
    "FFF": {
      "current": "FMA3__AVX2",
      "available": "FMA3__AVX2 baseline(SSE SSE2 SSE3)"
    }
 }
}

For tracer utilization, remember to invoke the new NPY_CPU_DISPATCH_TRACE()
macro either before or after employing NPY_CPU_DISPATCH_CALL() for dispatching.
For more clarification, please refer to the header
numpy/core/src/common/npy_cpu_dispatch.h.

As part of this solution, a new dictionary, __cpu_targets_info__, has been introduced within
the numpy.core._multiarray_umath module. This dictionary contains relevant data
about enabled targets for each optimized function.

As of now, the tracing mechanism covers ufunc-based functions, argmax, and argmin
However, functions like sorting operations may require refactoring due to the
tracer's associated cost.
It's noteworthy that the tracer should be called only once during the initialization
of Python C functions to avoid regressions.

rgommers · 2023-08-15T06:08:19Z

This looks pretty cool! Should make it a lot easier to figure out both what's available and what's being used.

A new function named opt_func_info has been added to the numpy.lib.utils module,

We may want to put this elsewhere, lib.utils won't stay in 2.0 as a public name I think. It could be lib.introspect perhaps? We wanted something like that IIRC.

However, functions like sorting operations may require refactoring due to the tracer's associated cost.

How significant is the runtime overhead?

seiko2plus · 2023-08-16T17:21:00Z

We may want to put this elsewhere, lib.utils won't stay in 2.0 as a public name I think. It could be lib.introspect perhaps? We wanted something like that IIRC.

Sounds good to me.

How significant is the runtime overhead?

The concern with sorting necessitates refactoring the runtime dispatch to enable dispatching only during the load of Numpy, similar to argmax and argmin. Refer to: link.

However, this can be handled via C++ static initialization to guarantee only one call, which may affect small arrays. See: link.

In general, it definitely impacts the speed of loading the numpy module. It makes several Python API calls to insert an entry into the global dict cpu_targets_info for each runtime dispatch. Refer to: link.

charris · 2023-08-23T19:54:55Z

Looks good. Could you document its use somewhere (I'm not sure where a good place would be) and add a release note. Maybe we need a new document section on tracing and performance tracking.

seiko2plus · 2023-08-24T06:28:44Z

Maybe we need a new document section on tracing and performance tracking.

Agreed, for now I'm going to add it under https://numpy.org/doc/stable/reference/simd/build-options.html#runtime-dispatch.

mattip · 2023-08-30T10:56:48Z

This is definitely a good step in the right direction: it tells which features will be used with which dtypes to determine runtime dispatching. But it is not the whole story, since additionally there are choices made due to memory overlap (rare) or strides (contiguous/non-contiguous) or shape (too small for BLAS/SIMD, square arrays and 1d arrays use a different path through matmul). I was dreaming of a decorator that would report exactly which loop was used in a particular ufunc call. Maybe that is too hard to do.

It would also be nice to get a report when BLAS is used, which might have helped debug #24512.

rgommers · 2023-08-30T17:46:21Z

We touched upon this in the community meeting today - np.lib.introspect seemed good to everyone, so we can go ahead with that here.

seiko2plus · 2023-08-31T20:18:32Z

But it is not the whole story, since additionally there are choices made due to memory overlap (rare) or strides (contiguous/non-contiguous) or shape (too small for BLAS/SIMD, square arrays and 1d arrays use a different path through matmul).

This pull request is aimed at tracking the enabled CPU targets, rather than delving into debugging inner SIMD branches. The compiler retains the ability to optimize scalar operations by utilizing native instructions compatible with the enabled targets. For instance, FMA native operations matter for tracking precision loss/gain.

However, introducing the capability to track such branches could be achieved by providing a specialized build option, such as -Dtrack-simd-regions. Nonetheless, this option not be suitable for release builds due to performance regressions.

I was dreaming of a decorator that would report exactly which loop was used in a particular ufunc call. Maybe that is too hard to do.

To precisely identify the SIMD branches based on the arguments, generating a backtrace will be required, not too hard but again not going to fit the release build.

It would also be nice to get a report when BLAS is used, which might have helped debug #24512.

Yes that possible but Is it already covered within show_config()?

charris · 2023-09-02T17:11:53Z

Need rebase.

seprated header This should should be removed once we drop the support of disutils

…tion This update introduces a tracer mechanism that enables tracking of the enabled targets for each optimized function in the NumPy library. With this enhancement, it becomes possible to precisely monitor the enabled CPU dispatch targets for the dispatched functions. A new function named `opt_func_info` has been added to the new namespace `numpy.lib.introspect` module, offering this tracing capability. This function allows you to retrieve information about the enabled targets based on function names and data type signatures. Here's an example of how to use it: ```python >> func_info = numpy.lib.introspect.opt_func_info(func_name='add|abs', signature='float64|complex64') >> print(json.dumps(func_info, indent=2)) { "absolute": { "dd": { "current": "SSE41", "available": "SSE41 baseline(SSE SSE2 SSE3)" }, "Ff": { "current": "FMA3__AVX2", "available": "AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)" }, "Dd": { "current": "FMA3__AVX2", "available": "AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)" } }, "add": { "ddd": { "current": "FMA3__AVX2", "available": "FMA3__AVX2 baseline(SSE SSE2 SSE3)" }, "FFF": { "current": "FMA3__AVX2", "available": "FMA3__AVX2 baseline(SSE SSE2 SSE3)" } } } ``` For tracer utilization, remember to invoke the new `NPY_CPU_DISPATCH_TRACE()` macro either before or after employing `NPY_CPU_DISPATCH_CALL()` for dispatching. For more clarification, please refer to the header `numpy/core/src/common/npy_cpu_dispatch.h`. As part of this solution, a new dictionary, `__cpu_targets_info__`, has been introduced within the `numpy.core._multiarray_umath` module. This dictionary contains relevant data about enabled targets for each optimized function. As of now, the tracing mechanism covers ufunc-based functions, `argmax`, and `argmin` However, functions like sorting operations may require refactoring due to the tracer's associated cost. It's noteworthy that the tracer should be called only once during the initialization of Python C functions to avoid performance regressions.

seiko2plus · 2023-09-03T01:27:05Z

Need rebase.

done

charris · 2023-09-05T17:34:34Z

Thanks Sayed. The backport of this looks to be tricky due to all the file renames in main, so it might not make it into 1.26.

rgommers · 2023-09-05T18:27:21Z

This is a new feature, so it probably wasn't right to backport it to 1.26.0 anyway - we were aiming for zero changes beyond build system swap and bug fixes.

charris · 2023-09-05T18:27:26Z

I'll leave this for 2.0.0.

seiko2plus force-pushed the cpu_targets_tracer branch from 2579bd9 to 105feaa Compare August 14, 2023 23:07

seiko2plus added 09 - Backport-Candidate PRs tagged should be backported component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Aug 15, 2023

seiko2plus force-pushed the cpu_targets_tracer branch 2 times, most recently from 46544dd to b4361f0 Compare August 15, 2023 00:11

seiko2plus marked this pull request as ready for review August 15, 2023 00:11

seiko2plus added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Aug 24, 2023

rgommers added this to the 2.0.0 release milestone Aug 30, 2023

seiko2plus force-pushed the cpu_targets_tracer branch 4 times, most recently from 17316e9 to d451f0b Compare August 31, 2023 05:22

seiko2plus force-pushed the cpu_targets_tracer branch from d451f0b to 96909b9 Compare August 31, 2023 20:22

seiko2plus added 3 commits September 3, 2023 03:57

SIMD: Cleanup the cpu dispatcher by moving disutils helper macros into a

9b1a9a7

seprated header This should should be removed once we drop the support of disutils

DOC: Add a release note for CPU Optimization Tracking

0972e6a

seiko2plus force-pushed the cpu_targets_tracer branch from 96909b9 to 0972e6a Compare September 2, 2023 23:59

charris removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Sep 4, 2023

charris merged commit 5ffeef1 into numpy:main Sep 5, 2023

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Sep 5, 2023

charris added the 03 - Maintenance label Sep 5, 2023

charris changed the title ~~SIMD: Introduce tracer for enabled CPU targets on each optimized function~~ MAINT: Introduce tracer for enabled CPU targets on each optimized function Sep 5, 2023

charris changed the title ~~MAINT: Introduce tracer for enabled CPU targets on each optimized function~~ ENH: Introduce tracer for enabled CPU targets on each optimized function Sep 5, 2023

charris added 01 - Enhancement and removed 03 - Maintenance labels Sep 5, 2023

mhvk mentioned this pull request Sep 11, 2024

Improving performance of compiled code by using hardware instructions available on the runtime hardware astropy/astropy#16902

Open

seiko2plus mentioned this pull request Apr 9, 2025

DOC: SIMD information in ufunc documentation #16459

Open

Uh oh!

Uh oh!

Conversation

seiko2plus commented Aug 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgommers commented Aug 15, 2023

Uh oh!

seiko2plus commented Aug 16, 2023

Uh oh!

charris commented Aug 23, 2023

Uh oh!

seiko2plus commented Aug 24, 2023

Uh oh!

mattip commented Aug 30, 2023

Uh oh!

rgommers commented Aug 30, 2023

Uh oh!

seiko2plus commented Aug 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Sep 2, 2023

Uh oh!

seiko2plus commented Sep 3, 2023

Uh oh!

charris commented Sep 5, 2023

Uh oh!

rgommers commented Sep 5, 2023

Uh oh!

charris commented Sep 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seiko2plus commented Aug 14, 2023 •

edited

Loading

seiko2plus commented Aug 31, 2023 •

edited

Loading