Skip to content

MAINT: Use fused-multiply-add for complex numbers calculations#26956

Open
yairchu wants to merge 3 commits into
numpy:mainfrom
yairchu:complex-fma
Open

MAINT: Use fused-multiply-add for complex numbers calculations#26956
yairchu wants to merge 3 commits into
numpy:mainfrom
yairchu:complex-fma

Conversation

@yairchu

@yairchu yairchu commented Jul 16, 2024

Copy link
Copy Markdown
Contributor

Background:

  • Depending on the compiler, plain C code may either use fused-multiply-add or not. For example numpy-1.26.4 on macOS did not use fma
  • However many calculations in numpy are not plain C code and use SIMD intrinsics, and deliberately use fma without depending on the compiler to choose that
  • When fma is not used in plain C loops there are inconsistencies in numpy. Sometimes even the exact same deterministic code executed twice in the same program may produce different results each time (BUG: Compiler-options-dependent bug in np.square for complex numbers affecting numpy-1.26.4 on macOS on ARM #26940)

This PR makes basic calculations for complex numbers always use fused-multiply-add, using the fma functions from <math.h>, and it also resolves an additional minor off-by-one bug which contributed to the inconsistency #26940

@yairchu

yairchu commented Jul 16, 2024

Copy link
Copy Markdown
Contributor Author

Looking at the CI it appears that on some environments fma doesn't really perform fma and falls back to non-fused? (https://github.com/numpy/numpy/actions/runs/9955770700/job/27504241438)
And so my new tests fail in those.

Would it be appropriate to skip the new test on such platforms? If so, what is the best way of doing that?

@melissawm melissawm added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 16, 2024
@charris charris changed the title Always use fused-multiply-add for complex calculations MAINT: Always use fused-multiply-add for complex calculations Jul 17, 2024
@Mousius

Mousius commented Jul 17, 2024

Copy link
Copy Markdown
Member

Looking at the CI it appears that on some environments fma doesn't really perform fma and falls back to non-fused? (https://github.com/numpy/numpy/actions/runs/9955770700/job/27504241438) And so my new tests fail in those.

Would it be appropriate to skip the new test on such platforms? If so, what is the best way of doing that?

I think that'd be the appropriate action to take. The baseline build of NumPy targets the minimum supported processor, so may not have FMA instructions. We detect if we can run a more optimal version and dynamically dispatch it.

If no hardware instruction is available, you might be able to use np.__config__.__cpu_baseline__ or np.__config__.__cpu_features__ to gate the test case if there's no instruction available on x86?

Comment thread numpy/_core/src/umath/loops_utils.h.src Outdated
@seberg

seberg commented Jul 17, 2024

Copy link
Copy Markdown
Member

Would it be appropriate to skip the new test on such platforms? If so, what is the best way of doing that?

I don't know. Since this seems to be distinct from capabilities (if the test fails, the SIMD code is used and uses FMA), so it is a compiler decision to not fuse it for speed probably.

We should be able to ensure FMA is used if SIMD uses it, I guess speed may be mostly irrelevant (the slow loop is relatively rarely never used), but dunno.
A best effort to just write fma and let the compiler do whatever it likes to do could be done, but doesn't give any promises to you as a user.

@yairchu

yairchu commented Jul 17, 2024

Copy link
Copy Markdown
Contributor Author

You're right that no complete cross-platform promise is gained by best effort to use fma, but I suppose that having the tests explicitly list for which platforms there isn't a promise will at least make this list explicit, and will guard against regressing in the other platforms?

As for the option of using SIMD code for scalar cases - wouldn't that make things slower?

@r-devulap

Copy link
Copy Markdown
Member

Looking at the CI it appears that on some environments fma doesn't really perform fma and falls back to non-fused?

My understanding was that fma and fmaf should compute an accurate results irrespective of what hardware it is run on. From https://en.cppreference.com/w/c/numeric/math/fma Computes (x * y) + z as if to infinite precision and rounded only once to fit the result type. Shouldn't that mean the test should pass everywhere? Curious to know why that CI failed.

As for the option of using SIMD code for scalar cases - wouldn't that make things slower?

It might be slower, but I think computing the value for a scalar would hardly show up on perf let alone be a bottleneck for a program (unless someone is looping over an array, in which case I argue they are using array processing incorrectly).

@yairchu

yairchu commented Jul 19, 2024

Copy link
Copy Markdown
Contributor Author

@r-devulap

My understanding was that fma and fmaf should compute an accurate results irrespective of what hardware it is run on. From https://en.cppreference.com/w/c/numeric/math/fma Computes (x * y) + z as if to infinite precision and rounded only once to fit the result type. Shouldn't that mean the test should pass everywhere? Curious to know why that CI failed.

If that's the case then perhaps those CI tests failed due to the scalar version being more accurate than the SIMD one!

From @seberg's comment I infer that whether fma does what cppreference says is platform dependent:

A best effort to just write fma and let the compiler do whatever it likes to do

Looking at the test output to figure out which is fma's behaviour currently doesn't help, because it just says that 2.018506e-13-2.649923e-13j isn't equal to itself.

Would it generally be a good idea for assert_equal to output at higher precision when the actual and desired are formatted the same? For now I also added another commit to do that, which may fit in another PR but included here at least to help make more sense from CI results.

@yairchu yairchu force-pushed the complex-fma branch 3 times, most recently from 63fc730 to b04949e Compare July 19, 2024 20:46
@Mousius

Mousius commented Jul 20, 2024

Copy link
Copy Markdown
Member

@r-devulap

My understanding was that fma and fmaf should compute an accurate results irrespective of what hardware it is run on. From https://en.cppreference.com/w/c/numeric/math/fma Computes (x * y) + z as if to infinite precision and rounded only once to fit the result type. Shouldn't that mean the test should pass everywhere? Curious to know why that CI failed.

Thanks, @r-devulap. My assumption was that it would fall back to a naive implementation, but this indicates otherwise.

If that's the case then perhaps those CI tests failed due to the scalar version being more accurate than the SIMD one!
From @seberg's comment I infer that whether fma does what cppreference says is platform dependent:

A best effort to just write fma and let the compiler do whatever it likes to do

Looking at the test output to figure out which is fma's behaviour currently doesn't help, because it just says that 2.018506e-13-2.649923e-13j isn't equal to itself.

Would it generally be a good idea for assert_equal to output at higher precision when the actual and desired are formatted the same? For now I also added another commit to do that, which may fit in another PR but included here at least to help make more sense from CI results.

Given what @r-devulap has said, I think they're both correct. I'm guessing using assert_array_max_ulp with a maxulp of 0.5 or 1 would work, as the two implementations will have slightly different roundings but still be considered correct.

Comment thread numpy/_core/tests/test_regression.py Outdated
@yairchu yairchu requested a review from r-devulap July 22, 2024 19:17
@yairchu yairchu changed the title MAINT: Always use fused-multiply-add for complex calculations MAINT: Use fused-multiply-add for complex numbers calculations Jul 27, 2024
@Mousius

Mousius commented Aug 1, 2024

Copy link
Copy Markdown
Member

@yairchu could you try changing some of the remaining failures on 32-bit from assert_array_almost_equal to assert_array_almost_equal_nulp?

yairchu added 2 commits August 5, 2024 23:15
…ltiplication

The test would only fail under the following conditions:
* replace the added assert_array_almost_equal_nulp with assert_equal
* compile numpy with "-ffp-contract=off" flags for clang (or equiavlent)
  * which was the case for numpy<2 on macOS
…6940 and and numpy#26740)

* numpy#26940 already fixed in same PR by nomemoverlap fix (failure depended on existence of two problems)
* For numpy#26740 only consistency within numpy is fixed. Regular complex in Python may still have different results

Currently, fused-multipliy-add is typically used everywhere in numpy, depending on the compiler.
But some compilers may choose not to use fma,
as is the case in numpy-1.26.4 (and any <2 I believe) on macOS on ARM.

This commit uses the fma functions from math.h instead of relying on the compiler to decide to use them.
@yairchu

yairchu commented Aug 5, 2024

Copy link
Copy Markdown
Contributor Author

@yairchu could you try changing some of the remaining failures on 32-bit from assert_array_almost_equal to assert_array_almost_equal_nulp?

@Mousius done!

The currently failing test appears to be an unrelated CI failure for test_randomstate.py. Am assuming it's not related to the complex numbers change as I also see azure-pipeline numpy.numpy actions fail for most recent commit builds too.

@r-devulap

Copy link
Copy Markdown
Member

@yairchu the 32-bit test still fails.

@jorenham

jorenham commented Jan 8, 2026

Copy link
Copy Markdown
Member

This seems to have stranded, so I'm going to close it. We can re-open if you still want to work on this @yairchu

@jorenham jorenham closed this Jan 8, 2026
@github-project-automation github-project-automation Bot moved this from Awaiting a code review to Completed in NumPy first-time contributor PRs Jan 8, 2026
@yairchu

yairchu commented Jan 8, 2026

Copy link
Copy Markdown
Contributor Author

@jorenham as this is fixing a real bug, I would still want to work on this, but unfortunately not in the near future. Perhaps I could look again in a few months and understand what was blocking this fix and address it.

@jorenham

jorenham commented Jan 8, 2026

Copy link
Copy Markdown
Member

@jorenham as this is fixing a real bug, I would still want to work on this, but unfortunately not in the near future. Perhaps I could look again in a few months and understand what was blocking this fix and address it.

I'm glad to hear that; reopened

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

03 - Maintenance 55 - Needs work component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Projects

Development

Successfully merging this pull request may close these issues.

7 participants