Skip to content

ENH: add C99 Annex G recoveries for complex multiplication and division following CPython#30806

Draft
ikrommyd wants to merge 11 commits intonumpy:mainfrom
ikrommyd:complex-div-and-mul-annex-g
Draft

ENH: add C99 Annex G recoveries for complex multiplication and division following CPython#30806
ikrommyd wants to merge 11 commits intonumpy:mainfrom
ikrommyd:complex-div-and-mul-annex-g

Conversation

@ikrommyd
Copy link
Copy Markdown
Contributor

@ikrommyd ikrommyd commented Feb 8, 2026

Adds C99 Annex G recoveries to complex multiplication and division that would otherwise result in nan + nanj
Closes #30805
Closes #26560
Closes #17425
Closes #30470

See relevant CPython PRs that did these updates for CPython:
python/cpython#120287 and python/cpython#119457

Because multiplication also has a SIMD implementation, we have to adapt that too because otherwise you'd get different answers depending on whether you have SIMD compiled numpy or not or whether you are working with complex64 and complex128 vs complex256 because I only see SIMD implementations for fp32 and fp64 only.

@ikrommyd
Copy link
Copy Markdown
Contributor Author

ikrommyd commented Feb 8, 2026

Opening it as draft for starters because there are decisions to be made from the maintainers here and it also definitely needs benchmarking since we're touching ufunc loops.

I'd also appreciate some help on the SIMD implementations as I am completely clueless about SIMD apart from knowing what it is so I asked Claude for help on that part :). The rest is all implemented manually although there can be typos because parsing .c.src files with my eyes is a bit difficult.

@charris
Copy link
Copy Markdown
Member

charris commented Feb 9, 2026

@r-devulap for the SIMD.

@ikrommyd
Copy link
Copy Markdown
Contributor Author

ikrommyd commented Feb 9, 2026

What i'm debating with myself here is whether performance or "correctness" and compliance with the C99 Annex G standard and consistency with CPython is more worth it because the extra if branch in the ufunc loops (marked as NPY_UNLIKELY) will have a performance cost. If maximum performance is the target, then just being inconsistent with CPython and the C99 Annex G may be just fine I guess. In that case the issues this PR fixes should be closed as "won't fix" cause you can't have it both ways I think.

@ikrommyd
Copy link
Copy Markdown
Contributor Author

ikrommyd commented Feb 9, 2026

The more I think about this and what we have to do to get multiplication consistent with other things that are effectively multiplication like square and power(x, 2), the more I believe it's best to only do this for division and ignore multiplication and powers all together. I think the performance hit that multiplications will get is probably not worth it. Happy to discuss this in a triage or community meeting though.

@ngoldbaum
Copy link
Copy Markdown
Member

ngoldbaum commented Feb 10, 2026

@WarrenWeckesser @skirpichev ping in case you're interested in this, since you commented on related issues.

@skirpichev
Copy link
Copy Markdown

I realize, that this is a draft, but for final version I suggest to split division and multiplication parts.

What i'm debating with myself here is whether performance or "correctness" and compliance with the C99 Annex G standard and consistency with CPython is more worth it because the extra if branch in the ufunc loops (marked as NPY_UNLIKELY) will have a performance cost.

Both for multiplication and division it will cost additional isnan() check in "normal" case. I appreciate benchmarks, that show it's effect. This is something, that was missing in my pr (#26595).

@ikrommyd
Copy link
Copy Markdown
Contributor Author

ikrommyd commented Feb 11, 2026

Yeah I have done some local benchmarks and have also used the existing ufunc benchmarks in the benchmark folder. I want to write a proper asv benchmark because for local benchmarks I've just been using a single script. For multiplication, the extra isnan check is comparable the actual multiplication in terms of speed cause multiplication is very fast and I've seen significant loss of performance for multiplication even in the nominal case. For division, because it's a more complicated operation, the loss isn't that bad in the nominal case. It can be like 5% maybe 10%. However all that changes if you have actually many values to correct in an array. I've been trying different percentages of "bad" (nan + nanj) values to correct. For multiplication, if for example 50% of your array is bad, the performance hit is massive. I will return here once I have a proper specific asv benchmark and post results for both my x86 linux and arm64 mac machines.

PS. For some existing existing ufunc benchmarks, I'm magically getting 20% improvement between this PR branch and main somehow (consistent across both my machines). I can't explain that yet.

@ngoldbaum
Copy link
Copy Markdown
Member

The asv ufunc benchmarks are noisy. If you compare the same commit against itself, you'll see random improvements and performance decreases.

@ikrommyd
Copy link
Copy Markdown
Contributor Author

ikrommyd commented Feb 11, 2026

When I compare main to main or PR with PR, asv doesn't tell me "significant performance changes" though. When I compare PR branch with main it does.

Anyways until I have an appropriate new benchmark, I've been using this naive in a way script
https://gist.github.com/ikrommyd/102de766bd88b551cb7dc489bb10d9be (courtesy of claude because I'm very lazy when my programming speed is limited by my typing speed)
It makes arrays with different percentages of "bad" numbers to correct and runs the affected ufuncs for different array sizes.
Here are the results for numpy compiled with spin build --clean -- -Dcpu-baseline=none -Dcpu-dispatch=none with python 3.14.3.
MAIN:

======================================================================
  complex128
======================================================================

  [clean]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      5.430      5.070      3.866      3.641      3.098     10.431     10.353      4.053
    10,000        (100x)        1.935      1.991      0.670      0.755      0.444      3.086      3.164      0.431
    100,000       (10x)         4.871      4.069      0.737      0.714      0.427      3.065      3.131      0.458
    1,000,000     (1x)          5.130      4.678      0.773      0.698      0.445      3.067      3.106      0.442

  [1% bad]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      5.565      4.968      3.886      3.942      3.095     10.426     10.330      4.178
    10,000        (100x)        1.955      2.037      0.707      0.743      0.441      3.142      3.245      0.447
    100,000       (10x)         4.990      4.131      0.742      0.697      0.429      3.061      3.213      0.432
    1,000,000     (1x)          5.141      4.620      0.699      0.699      0.434      3.119      3.207      0.424

  [5% bad]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      5.309      5.028      4.453      4.188      3.592     11.000     10.970      4.657
    10,000        (100x)        1.926      2.109      0.663      0.725      0.427      3.202      3.203      0.436
    100,000       (10x)         4.956      4.192      0.764      0.723      0.430      3.183      3.185      0.441
    1,000,000     (1x)          5.149      4.670      0.699      0.691      0.432      3.106      3.184      0.413

  [10% bad]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      5.307      4.957      4.428      4.472      3.580     10.760     10.639      4.665
    10,000        (100x)        1.941      2.170      0.695      0.736      0.439      3.149      3.232      0.451
    100,000       (10x)         4.962      4.279      0.739      0.721      0.428      3.171      3.244      0.436
    1,000,000     (1x)          5.171      4.712      0.703      0.703      0.433      3.159      3.216      0.463

  [25% bad]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      5.562      5.220      4.431      4.496      3.611     10.703     10.806      4.610
    10,000        (100x)        1.921      2.391      0.676      0.759      0.430      3.310      3.311      0.453
    100,000       (10x)         4.688      4.310      0.767      0.724      0.414      3.276      3.429      0.487
    1,000,000     (1x)          5.194      4.804      0.739      0.697      0.426      3.311      3.441      0.436

  [50% bad]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      5.464      6.723      4.418      4.456      3.612     10.960     10.844      4.600
    10,000        (100x)        1.912      2.656      0.715      0.759      0.449      3.326      3.334      0.438
    100,000       (10x)         5.046      4.464      0.789      0.725      0.394      3.411      3.552      0.457
    1,000,000     (1x)          5.302      5.051      0.739      0.699      0.440      3.423      3.568      0.439

vs PR branch:

======================================================================
  complex128
======================================================================

  [clean]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      5.571      5.871      3.968      3.978      3.430     11.109     11.038      4.422
    10,000        (100x)        1.937      2.993      0.920      0.922      0.760      3.865      3.947      0.811
    100,000       (10x)         4.732      3.946      0.910      0.886      0.742      3.719      3.839      0.788
    1,000,000     (1x)          5.105      4.611      0.899      0.879      0.748      3.758      3.850      0.742

  [1% bad]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      5.436      5.881      4.079      4.096      3.493     11.396     11.204      4.549
    10,000        (100x)        1.941      3.003      0.948      1.013      0.848      4.119      4.147      0.856
    100,000       (10x)         4.856      3.881      1.014      0.985      0.879      3.989      4.115      0.902
    1,000,000     (1x)          5.406      4.793      1.131      1.037      0.896      4.059      4.131      0.909

  [5% bad]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      5.512      6.013      4.631      4.566      4.035     11.901     11.902      5.011
    10,000        (100x)        1.978      3.104      1.014      1.022      0.845      4.108      4.160      0.853
    100,000       (10x)         6.094      4.849      1.202      1.113      0.943      4.354      4.453      0.992
    1,000,000     (1x)          6.473      5.464      1.730      1.462      1.237      4.457      4.595      1.249

  [10% bad]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      5.611      6.019      4.784      4.663      4.123     12.028     11.850      5.107
    10,000        (100x)        2.016      3.142      1.086      1.027      0.829      4.144      4.197      0.883
    100,000       (10x)         7.541      5.990      2.102      1.533      1.220      4.811      4.804      1.223
    1,000,000     (1x)          7.810      6.320      2.567      2.029      1.797      4.990      4.988      1.798

  [25% bad]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      5.721      6.279      5.006      4.868      4.272     12.447     12.462      5.206
    10,000        (100x)        2.138      3.372      1.347      1.229      0.977      4.311      4.380      0.986
    100,000       (10x)        11.681      8.649      5.040      3.452      3.081      6.682      6.685      3.065
    1,000,000     (1x)         11.924      8.886      5.339      3.806      3.493      6.915      6.893      3.497

  [50% bad]
    n             (x)          divide reciprocal   multiply        a*a     square      power  float_pow         **
    100           (10000x)      6.007      7.976      5.310      5.031      4.310     12.333     12.155      5.255
    10,000        (100x)        2.305      3.600      1.890      1.443      1.111      4.618      4.754      1.120
    100,000       (10x)        14.957     12.396      9.054      6.859      6.275     10.364     10.428      6.283
    1,000,000     (1x)         15.017     12.586      9.228      7.047      6.506     10.496     10.514      6.500

This is on my AMD Ryzen 7 7800X3D.

@skirpichev
Copy link
Copy Markdown

skirpichev commented Feb 11, 2026

Results looks noisy, when I'm trying to compare "clean" cases. Sometimes (in same column) PR is even faster (e.g. divide). I suggest using pyperf for benchmarks.

@ikrommyd
Copy link
Copy Markdown
Contributor Author

ikrommyd commented Feb 11, 2026

Results looks noisy, when I'm trying to compare "clean" cases. Sometimes (in same column) PR is even faster (e.g. divide). I suggest using pyperf for benchmarks.

Yeah I need to do better benchmarking but division being better sometimes is also something that I saw with asv. Anyways, I think it’s evident however that multiplication takes a big hit, especially if you have a lot of bad numbers to correct.

@ikrommyd
Copy link
Copy Markdown
Contributor Author

ikrommyd commented Feb 11, 2026

Follow up from the community meeting

-Ofast and -ffast-math disable these recoveries for C++, otherwise they are there (may be implementation dependent though).

numpy on  complex-div-and-mul-annex-g [?] is 󰏗 v2.5.0.dev0 via  pyenv 3.14.3
❯ g++ main.cpp -o main -O0; ./main
(1+1j)/(inf+infj) = (0,0)
(inf+nanj)/(2^1000+2^-1000j) = (inf,-inf)
(1e300+1j)*(nan+infj) = (-inf,inf)

numpy on  complex-div-and-mul-annex-g [?] is 󰏗 v2.5.0.dev0 via  pyenv 3.14.3
❯ g++ main.cpp -o main -O1; ./main
(1+1j)/(inf+infj) = (0,0)
(inf+nanj)/(2^1000+2^-1000j) = (inf,-inf)
(1e300+1j)*(nan+infj) = (-inf,inf)

numpy on  complex-div-and-mul-annex-g [?] is 󰏗 v2.5.0.dev0 via  pyenv 3.14.3
❯ g++ main.cpp -o main -O2; ./main
(1+1j)/(inf+infj) = (0,0)
(inf+nanj)/(2^1000+2^-1000j) = (inf,-inf)
(1e300+1j)*(nan+infj) = (-inf,inf)

numpy on  complex-div-and-mul-annex-g [?] is 󰏗 v2.5.0.dev0 via  pyenv 3.14.3
❯ g++ main.cpp -o main -O3; ./main
(1+1j)/(inf+infj) = (0,0)
(inf+nanj)/(2^1000+2^-1000j) = (inf,-inf)
(1e300+1j)*(nan+infj) = (-inf,inf)

numpy on  complex-div-and-mul-annex-g [?] is 󰏗 v2.5.0.dev0 via  pyenv 3.14.3
❯ g++ main.cpp -o main -Ofast; ./main
(1+1j)/(inf+infj) = (nan,nan)
(inf+nanj)/(2^1000+2^-1000j) = (nan,nan)
(1e300+1j)*(nan+infj) = (nan,nan)

numpy on  complex-div-and-mul-annex-g [?] is 󰏗 v2.5.0.dev0 via  pyenv 3.14.3
❯ g++ main.cpp -o main -ffast-math; ./main
(1+1j)/(inf+infj) = (-nan,-nan)
(inf+nanj)/(2^1000+2^-1000j) = (nan,nan)
(1e300+1j)*(nan+infj) = (nan,nan)
#include <complex>
#include <cmath>
#include <iostream>
#include <limits>

int main() {
    double inf = std::numeric_limits<double>::infinity();
    double mynan = std::numeric_limits<double>::quiet_NaN();

    std::complex<double> a(1, 1);
    std::complex<double> b(inf, inf);
    std::cout << "(1+1j)/(inf+infj) = " << a/b << std::endl;

    std::complex<double> d(inf, mynan);
    std::complex<double> e(std::pow(2,1000), std::pow(2,-1000));
    std::cout << "(inf+nanj)/(2^1000+2^-1000j) = " << d/e << std::endl;

    std::complex<double> z(1e300, 1);
    std::complex<double> c(mynan, inf);
    std::cout << "(1e300+1j)*(nan+infj) = " << z*c << std::endl;

    return 0;
}

cc @seberg @mattip

EDIT(seberg): Godbolt link: https://godbolt.org/z/5dvqqYeWh

@seberg
Copy link
Copy Markdown
Member

seberg commented Feb 11, 2026

Yeah I need to do better benchmarking but division being better sometimes is also something that I saw with asv.

Yeah, the benchmarks can change rather randomly unfortunately. It might be a good indicator if the change is similar across float and double for example.

If C++ implements these, I am tempted to take a small performance hit for it (although there have been very few real complaints about it). It seems like one of those things that is nice to get right, but in practice almost nobody notices :).

@ikrommyd
Copy link
Copy Markdown
Contributor Author

ikrommyd commented Feb 11, 2026

Yeah I'd think what we actually want here is a slim to none performance hit only in the nominal case (no nan + nanj to correct) and a small performance hit in a realistic case where some values are nan + nanj and need correction (I don't know that that realistic case might be. 1% of entries? 5% of entries?)

@skirpichev
Copy link
Copy Markdown

Yeah I'd think what we actually want here is a slim to none performance hit only in the nominal case (no nan + nanj to correct)

But for multiplication I see difference up to 16% in "clean" case. This looks too much for me for one isnan() check vs 6 arithmetic ops.

-Ofast and -ffast-math disable these recoveries for C++

Not a surprise, these options are to break standard.

In the clang, implementation of complex multiplication and division controlled by -fcomplex-arithmetic; IIRIC, -ffast-math assumes something like -fcomplex-arithmetic=basic, i.e. primitive algorithms (no Smith's division, for example) and without recover of NaN's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment