Skip to content

ENH: Introduce tracer for enabled CPU targets on each optimized function#24420

Merged
charris merged 3 commits intonumpy:mainfrom
seiko2plus:cpu_targets_tracer
Sep 5, 2023
Merged

ENH: Introduce tracer for enabled CPU targets on each optimized function#24420
charris merged 3 commits intonumpy:mainfrom
seiko2plus:cpu_targets_tracer

Conversation

@seiko2plus
Copy link
Member

@seiko2plus seiko2plus commented Aug 14, 2023

SIMD: Introduce tracer for enabled CPU targets on each optimized function

This update introduces a tracer mechanism that enables tracking of the enabled targets
for each optimized function in the NumPy library. With this enhancement,
it becomes possible to precisely monitor the enabled CPU dispatch
targets for the dispatched functions.

A new function named opt_func_info has been added to the numpy.lib.utils module,
offering this tracing capability. This function allows you to retrieve information
about the enabled targets based on function names and data type signatures.

Here's an example of how to use it:

>> func_info = numpy.lib.utils.opt_func_info(func_name='add|abs', signature='float64|complex64')
>> print(json.dumps(func_info, indent=2))
{
  "absolute": {
    "dd": {
      "current": "SSE41",
      "available": "SSE41 baseline(SSE SSE2 SSE3)"
    },
    "Ff": {
      "current": "FMA3__AVX2",
      "available": "AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)"
    },
    "Dd": {
      "current": "FMA3__AVX2",
      "available": "AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)"
    }
  },
  "add": {
    "ddd": {
      "current": "FMA3__AVX2",
      "available": "FMA3__AVX2 baseline(SSE SSE2 SSE3)"
    },
    "FFF": {
      "current": "FMA3__AVX2",
      "available": "FMA3__AVX2 baseline(SSE SSE2 SSE3)"
    }
 }
}

For tracer utilization, remember to invoke the new NPY_CPU_DISPATCH_TRACE()
macro either before or after employing NPY_CPU_DISPATCH_CALL() for dispatching.
For more clarification, please refer to the header
numpy/core/src/common/npy_cpu_dispatch.h.

As part of this solution, a new dictionary, __cpu_targets_info__, has been introduced within
the numpy.core._multiarray_umath module. This dictionary contains relevant data
about enabled targets for each optimized function.

As of now, the tracing mechanism covers ufunc-based functions, argmax, and argmin
However, functions like sorting operations may require refactoring due to the
tracer's associated cost.
It's noteworthy that the tracer should be called only once during the initialization
of Python C functions to avoid regressions.

@seiko2plus seiko2plus added 09 - Backport-Candidate PRs tagged should be backported component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Aug 15, 2023
@seiko2plus seiko2plus force-pushed the cpu_targets_tracer branch 2 times, most recently from 46544dd to b4361f0 Compare August 15, 2023 00:11
@seiko2plus seiko2plus marked this pull request as ready for review August 15, 2023 00:11
@rgommers
Copy link
Member

This looks pretty cool! Should make it a lot easier to figure out both what's available and what's being used.

A new function named opt_func_info has been added to the numpy.lib.utils module,

We may want to put this elsewhere, lib.utils won't stay in 2.0 as a public name I think. It could be lib.introspect perhaps? We wanted something like that IIRC.

However, functions like sorting operations may require refactoring due to the tracer's associated cost.

How significant is the runtime overhead?

@seiko2plus
Copy link
Member Author

We may want to put this elsewhere, lib.utils won't stay in 2.0 as a public name I think. It could be lib.introspect perhaps? We wanted something like that IIRC.

Sounds good to me.

How significant is the runtime overhead?

The concern with sorting necessitates refactoring the runtime dispatch to enable dispatching only during the load of Numpy, similar to argmax and argmin. Refer to: link.

However, this can be handled via C++ static initialization to guarantee only one call, which may affect small arrays. See: link.

In general, it definitely impacts the speed of loading the numpy module. It makes several Python API calls to insert an entry into the global dict cpu_targets_info for each runtime dispatch. Refer to: link.

@charris
Copy link
Member

charris commented Aug 23, 2023

Looks good. Could you document its use somewhere (I'm not sure where a good place would be) and add a release note. Maybe we need a new document section on tracing and performance tracking.

@seiko2plus seiko2plus added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Aug 24, 2023
@seiko2plus
Copy link
Member Author

Maybe we need a new document section on tracing and performance tracking.

Agreed, for now I'm going to add it under https://numpy.org/doc/stable/reference/simd/build-options.html#runtime-dispatch.

@mattip
Copy link
Member

mattip commented Aug 30, 2023

This is definitely a good step in the right direction: it tells which features will be used with which dtypes to determine runtime dispatching. But it is not the whole story, since additionally there are choices made due to memory overlap (rare) or strides (contiguous/non-contiguous) or shape (too small for BLAS/SIMD, square arrays and 1d arrays use a different path through matmul). I was dreaming of a decorator that would report exactly which loop was used in a particular ufunc call. Maybe that is too hard to do.

It would also be nice to get a report when BLAS is used, which might have helped debug #24512.

@rgommers
Copy link
Member

We touched upon this in the community meeting today - np.lib.introspect seemed good to everyone, so we can go ahead with that here.

@rgommers rgommers added this to the 2.0.0 release milestone Aug 30, 2023
@seiko2plus seiko2plus force-pushed the cpu_targets_tracer branch 4 times, most recently from 17316e9 to d451f0b Compare August 31, 2023 05:22
@seiko2plus
Copy link
Member Author

seiko2plus commented Aug 31, 2023

But it is not the whole story, since additionally there are choices made due to memory overlap (rare) or strides (contiguous/non-contiguous) or shape (too small for BLAS/SIMD, square arrays and 1d arrays use a different path through matmul).

This pull request is aimed at tracking the enabled CPU targets, rather than delving into debugging inner SIMD branches. The compiler retains the ability to optimize scalar operations by utilizing native instructions compatible with the enabled targets. For instance, FMA native operations matter for tracking precision loss/gain.

However, introducing the capability to track such branches could be achieved by providing a specialized build option, such as -Dtrack-simd-regions. Nonetheless, this option not be suitable for release builds due to performance regressions.

I was dreaming of a decorator that would report exactly which loop was used in a particular ufunc call. Maybe that is too hard to do.

To precisely identify the SIMD branches based on the arguments, generating a backtrace will be required, not too hard but again not going to fit the release build.

It would also be nice to get a report when BLAS is used, which might have helped debug #24512.

Yes that possible but Is it already covered within show_config()?

@charris
Copy link
Member

charris commented Sep 2, 2023

Need rebase.

seprated header

  This should should be removed once we drop the support of disutils
…tion

  This update introduces a tracer mechanism that enables tracking of the enabled targets
  for each optimized function in the NumPy library. With this enhancement,
  it becomes possible to precisely monitor the enabled CPU dispatch
  targets for the dispatched functions.

  A new function named `opt_func_info` has been added to the new namespace `numpy.lib.introspect` module,
  offering this tracing capability. This function allows you to retrieve information
  about the enabled targets based on function names and data type signatures.

  Here's an example of how to use it:

  ```python
  >> func_info = numpy.lib.introspect.opt_func_info(func_name='add|abs', signature='float64|complex64')
  >> print(json.dumps(func_info, indent=2))
  {
    "absolute": {
      "dd": {
        "current": "SSE41",
        "available": "SSE41 baseline(SSE SSE2 SSE3)"
      },
      "Ff": {
        "current": "FMA3__AVX2",
        "available": "AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)"
      },
      "Dd": {
        "current": "FMA3__AVX2",
        "available": "AVX512F FMA3__AVX2 baseline(SSE SSE2 SSE3)"
      }
    },
    "add": {
      "ddd": {
        "current": "FMA3__AVX2",
        "available": "FMA3__AVX2 baseline(SSE SSE2 SSE3)"
      },
      "FFF": {
        "current": "FMA3__AVX2",
        "available": "FMA3__AVX2 baseline(SSE SSE2 SSE3)"
      }
   }
  }
  ```
  For tracer utilization, remember to invoke the new `NPY_CPU_DISPATCH_TRACE()`
  macro either before or after employing `NPY_CPU_DISPATCH_CALL()` for dispatching.
  For more clarification, please refer to the header
  `numpy/core/src/common/npy_cpu_dispatch.h`.

  As part of this solution, a new dictionary, `__cpu_targets_info__`, has been introduced within
  the `numpy.core._multiarray_umath` module. This dictionary contains relevant data
  about enabled targets for each optimized function.

  As of now, the tracing mechanism covers ufunc-based functions, `argmax`, and `argmin`
  However, functions like sorting operations may require refactoring due to the
  tracer's associated cost.
  It's noteworthy that the tracer should be called only once during the initialization
  of Python C functions to avoid performance regressions.
@seiko2plus
Copy link
Member Author

Need rebase.

done

@charris charris removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Sep 4, 2023
@charris charris merged commit 5ffeef1 into numpy:main Sep 5, 2023
@charris
Copy link
Member

charris commented Sep 5, 2023

Thanks Sayed. The backport of this looks to be tricky due to all the file renames in main, so it might not make it into 1.26.

@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Sep 5, 2023
@charris charris changed the title SIMD: Introduce tracer for enabled CPU targets on each optimized function MAINT: Introduce tracer for enabled CPU targets on each optimized function Sep 5, 2023
@charris charris changed the title MAINT: Introduce tracer for enabled CPU targets on each optimized function ENH: Introduce tracer for enabled CPU targets on each optimized function Sep 5, 2023
@rgommers
Copy link
Member

rgommers commented Sep 5, 2023

This is a new feature, so it probably wasn't right to backport it to 1.26.0 anyway - we were aiming for zero changes beyond build system swap and bug fixes.

@charris
Copy link
Member

charris commented Sep 5, 2023

I'll leave this for 2.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants