-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Description
This stackoverflow question seems to have identified two real scaling bottlenecks inside NumPy.
I build Python 3.15t from source using pyenv:
$ CFLAGS=-O2 pyenv install 3.15t-dev -k -g
...
$ pyenv global 3.15t-dev-debug
Then, in an up-to-date clone of the NumPy repo:
python -m pip install . -v -Cbuilddir=build -C'setup-args=-Dbuildtype=debugoptimized'
And finally run this script, which has to be named mtmp.py: https://gist.github.com/ngoldbaum/ff6428e2510e991247eba4c809dcc8ac. This is identical to what was posted on StackOverflow but with some code commented out.
On my M3 Macbook Pro, I get the following stdout running the script:
Inner loops 10, multithreading time: 6.68 sec, result sum: 717434683.1879175
Inner loops 10, multiprocessing time: 4.86 sec, result sum: 717434683.1879175
If I run the script like so:
PYTHONPERFSUPPORT=1 samply record $(pyenv which python) mtmp.py
Then I get a profile output like this: https://share.firefox.dev/3Lde1Lm. I see two different scaling bottlenecks. First, there seem to be a mutex inside tracemalloc that has contention.
Second, the critical section I added to the array creation routines in #29394 is causing some contention in this case.