MAINT: Use fused-multiply-add for complex numbers calculations by yairchu · Pull Request #26956 · numpy/numpy

yairchu · 2024-07-16T11:13:47Z

Background:

Depending on the compiler, plain C code may either use fused-multiply-add or not. For example numpy-1.26.4 on macOS did not use fma
However many calculations in numpy are not plain C code and use SIMD intrinsics, and deliberately use fma without depending on the compiler to choose that
When fma is not used in plain C loops there are inconsistencies in numpy. Sometimes even the exact same deterministic code executed twice in the same program may produce different results each time (BUG: Compiler-options-dependent bug in np.square for complex numbers affecting numpy-1.26.4 on macOS on ARM #26940)

This PR makes basic calculations for complex numbers always use fused-multiply-add, using the fma functions from <math.h>, and it also resolves an additional minor off-by-one bug which contributed to the inconsistency #26940

yairchu · 2024-07-16T11:32:13Z

Looking at the CI it appears that on some environments fma doesn't really perform fma and falls back to non-fused? (https://github.com/numpy/numpy/actions/runs/9955770700/job/27504241438)
And so my new tests fail in those.

Would it be appropriate to skip the new test on such platforms? If so, what is the best way of doing that?

Mousius · 2024-07-17T09:23:45Z

Looking at the CI it appears that on some environments fma doesn't really perform fma and falls back to non-fused? (https://github.com/numpy/numpy/actions/runs/9955770700/job/27504241438) And so my new tests fail in those.

Would it be appropriate to skip the new test on such platforms? If so, what is the best way of doing that?

I think that'd be the appropriate action to take. The baseline build of NumPy targets the minimum supported processor, so may not have FMA instructions. We detect if we can run a more optimal version and dynamically dispatch it.

If no hardware instruction is available, you might be able to use np.__config__.__cpu_baseline__ or np.__config__.__cpu_features__ to gate the test case if there's no instruction available on x86?

seberg · 2024-07-17T10:12:20Z

Would it be appropriate to skip the new test on such platforms? If so, what is the best way of doing that?

I don't know. Since this seems to be distinct from capabilities (if the test fails, the SIMD code is used and uses FMA), so it is a compiler decision to not fuse it for speed probably.

We should be able to ensure FMA is used if SIMD uses it, I guess speed may be mostly irrelevant (the slow loop is relatively rarely never used), but dunno.
A best effort to just write fma and let the compiler do whatever it likes to do could be done, but doesn't give any promises to you as a user.

yairchu · 2024-07-17T20:05:46Z

You're right that no complete cross-platform promise is gained by best effort to use fma, but I suppose that having the tests explicitly list for which platforms there isn't a promise will at least make this list explicit, and will guard against regressing in the other platforms?

As for the option of using SIMD code for scalar cases - wouldn't that make things slower?

r-devulap · 2024-07-18T20:19:25Z

Looking at the CI it appears that on some environments fma doesn't really perform fma and falls back to non-fused?

My understanding was that fma and fmaf should compute an accurate results irrespective of what hardware it is run on. From https://en.cppreference.com/w/c/numeric/math/fma Computes (x * y) + z as if to infinite precision and rounded only once to fit the result type. Shouldn't that mean the test should pass everywhere? Curious to know why that CI failed.

As for the option of using SIMD code for scalar cases - wouldn't that make things slower?

It might be slower, but I think computing the value for a scalar would hardly show up on perf let alone be a bottleneck for a program (unless someone is looping over an array, in which case I argue they are using array processing incorrectly).

yairchu · 2024-07-19T09:52:52Z

@r-devulap

My understanding was that fma and fmaf should compute an accurate results irrespective of what hardware it is run on. From https://en.cppreference.com/w/c/numeric/math/fma Computes (x * y) + z as if to infinite precision and rounded only once to fit the result type. Shouldn't that mean the test should pass everywhere? Curious to know why that CI failed.

If that's the case then perhaps those CI tests failed due to the scalar version being more accurate than the SIMD one!

From @seberg's comment I infer that whether fma does what cppreference says is platform dependent:

A best effort to just write fma and let the compiler do whatever it likes to do

Looking at the test output to figure out which is fma's behaviour currently doesn't help, because it just says that 2.018506e-13-2.649923e-13j isn't equal to itself.

Would it generally be a good idea for assert_equal to output at higher precision when the actual and desired are formatted the same? For now I also added another commit to do that, which may fit in another PR but included here at least to help make more sense from CI results.

Mousius · 2024-07-20T13:59:20Z

@r-devulap

My understanding was that fma and fmaf should compute an accurate results irrespective of what hardware it is run on. From https://en.cppreference.com/w/c/numeric/math/fma Computes (x * y) + z as if to infinite precision and rounded only once to fit the result type. Shouldn't that mean the test should pass everywhere? Curious to know why that CI failed.

Thanks, @r-devulap. My assumption was that it would fall back to a naive implementation, but this indicates otherwise.

If that's the case then perhaps those CI tests failed due to the scalar version being more accurate than the SIMD one!
From @seberg's comment I infer that whether fma does what cppreference says is platform dependent:

A best effort to just write fma and let the compiler do whatever it likes to do

Looking at the test output to figure out which is fma's behaviour currently doesn't help, because it just says that 2.018506e-13-2.649923e-13j isn't equal to itself.

Would it generally be a good idea for assert_equal to output at higher precision when the actual and desired are formatted the same? For now I also added another commit to do that, which may fit in another PR but included here at least to help make more sense from CI results.

Given what @r-devulap has said, I think they're both correct. I'm guessing using assert_array_max_ulp with a maxulp of 0.5 or 1 would work, as the two implementations will have slightly different roundings but still be considered correct.

Mousius · 2024-08-01T11:56:07Z

@yairchu could you try changing some of the remaining failures on 32-bit from assert_array_almost_equal to assert_array_almost_equal_nulp?

…ltiplication The test would only fail under the following conditions: * replace the added assert_array_almost_equal_nulp with assert_equal * compile numpy with "-ffp-contract=off" flags for clang (or equiavlent) * which was the case for numpy<2 on macOS

…6940 and and numpy#26740) * numpy#26940 already fixed in same PR by nomemoverlap fix (failure depended on existence of two problems) * For numpy#26740 only consistency within numpy is fixed. Regular complex in Python may still have different results Currently, fused-multipliy-add is typically used everywhere in numpy, depending on the compiler. But some compilers may choose not to use fma, as is the case in numpy-1.26.4 (and any <2 I believe) on macOS on ARM. This commit uses the fma functions from math.h instead of relying on the compiler to decide to use them.

yairchu · 2024-08-05T20:41:49Z

@yairchu could you try changing some of the remaining failures on 32-bit from assert_array_almost_equal to assert_array_almost_equal_nulp?

@Mousius done!

The currently failing test appears to be an unrelated CI failure for test_randomstate.py. Am assuming it's not related to the complex numbers change as I also see azure-pipeline numpy.numpy actions fail for most recent commit builds too.

r-devulap · 2024-10-16T17:45:19Z

@yairchu the 32-bit test still fails.

jorenham · 2026-01-08T17:14:08Z

This seems to have stranded, so I'm going to close it. We can re-open if you still want to work on this @yairchu

yairchu · 2026-01-08T20:10:41Z

@jorenham as this is fixing a real bug, I would still want to work on this, but unfortunately not in the near future. Perhaps I could look again in a few months and understand what was blocking this fix and address it.

jorenham · 2026-01-08T20:34:52Z

@jorenham as this is fixing a real bug, I would still want to work on this, but unfortunately not in the near future. Perhaps I could look again in a few months and understand what was blocking this fix and address it.

I'm glad to hear that; reopened

yairchu mentioned this pull request Jul 16, 2024

BUG: Compiler-options-dependent bug in np.square for complex numbers affecting numpy-1.26.4 on macOS on ARM #26940

Open

yairchu force-pushed the complex-fma branch from cec694f to e964047 Compare July 16, 2024 11:28

melissawm added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 16, 2024

yairchu force-pushed the complex-fma branch from e964047 to 6c77fd8 Compare July 16, 2024 15:30

charris changed the title ~~Always use fused-multiply-add for complex calculations~~ MAINT: Always use fused-multiply-add for complex calculations Jul 17, 2024

charris added the 03 - Maintenance label Jul 17, 2024

Mousius reviewed Jul 17, 2024

View reviewed changes

Comment thread numpy/_core/src/umath/loops_utils.h.src Outdated

yairchu mentioned this pull request Jul 17, 2024

BUG: Off by one in memory overlap check #26972

Merged

yairchu force-pushed the complex-fma branch from 6c77fd8 to c844ddd Compare July 17, 2024 19:50

yairchu force-pushed the complex-fma branch 3 times, most recently from 63fc730 to b04949e Compare July 19, 2024 20:46

r-devulap requested changes Jul 20, 2024

View reviewed changes

Comment thread numpy/_core/tests/test_regression.py Outdated

yairchu force-pushed the complex-fma branch from b04949e to 1c3ae49 Compare July 22, 2024 19:15

yairchu requested a review from r-devulap July 22, 2024 19:17

yairchu changed the title ~~MAINT: Always use fused-multiply-add for complex calculations~~ MAINT: Use fused-multiply-add for complex numbers calculations Jul 27, 2024

charris mentioned this pull request Jul 30, 2024

BUG: Off by one in memory overlap check #27077

Merged

yairchu force-pushed the complex-fma branch from 1c3ae49 to 74e2a4c Compare July 30, 2024 07:35

yairchu added 2 commits August 5, 2024 23:15

yairchu force-pushed the complex-fma branch from 74e2a4c to ca22bcd Compare August 5, 2024 20:15

Merge branch 'main' into complex-fma

6913d6d

jorenham closed this Jan 8, 2026

github-project-automation Bot moved this from Awaiting a code review to Completed in NumPy first-time contributor PRs Jan 8, 2026

jorenham reopened this Jan 8, 2026

jorenham added the 55 - Needs work label Jan 8, 2026

Uh oh!

Uh oh!

Conversation

yairchu commented Jul 16, 2024

Uh oh!

yairchu commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mousius commented Jul 17, 2024

Uh oh!

Uh oh!

seberg commented Jul 17, 2024

Uh oh!

yairchu commented Jul 17, 2024

Uh oh!

r-devulap commented Jul 18, 2024

Uh oh!

yairchu commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mousius commented Jul 20, 2024

Uh oh!

Uh oh!

Mousius commented Aug 1, 2024

Uh oh!

yairchu commented Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Oct 16, 2024

Uh oh!

jorenham commented Jan 8, 2026

Uh oh!

yairchu commented Jan 8, 2026

Uh oh!

jorenham commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yairchu commented Jul 16, 2024 •

edited

Loading

yairchu commented Jul 19, 2024 •

edited

Loading

yairchu commented Aug 5, 2024 •

edited

Loading