Skip to content

ENH: enable alloc cache on free-threading builds#30499

Merged
seberg merged 13 commits into
numpy:mainfrom
kumaraditya303:alloc
Jan 5, 2026
Merged

ENH: enable alloc cache on free-threading builds#30499
seberg merged 13 commits into
numpy:mainfrom
kumaraditya303:alloc

Conversation

@kumaraditya303

@kumaraditya303 kumaraditya303 commented Dec 22, 2025

Copy link
Copy Markdown
Contributor

This PR enables the alloc cache of free-threading builds by making it thread local. Verified it locally by running all numpy tests under pytest-run-parallel and all passed.

cc @ngoldbaum

@kumaraditya303 kumaraditya303 marked this pull request as ready for review December 22, 2025 11:54
@seberg

seberg commented Dec 22, 2025

Copy link
Copy Markdown
Member

Hmmmm, if we do this, don't we leak the cache when threads get destroyed? We could tie cleanup somehow to the Python thread object, but that would add a bit of complexity.

@kumaraditya303

Copy link
Copy Markdown
Contributor Author

Hmmmm, if we do this, don't we leak the cache when threads get destroyed? We could tie cleanup somehow to the Python thread object, but that would add a bit of complexity.

Yes, any cached object would not get cleaned up if the thread gets destroyed. Tracking and cleaning that up is complex because if we do it, then it needs to happen before Python is finalized so we cannot use things like C++ destructors.

@ngoldbaum

Copy link
Copy Markdown
Member

Yes, any cached object would not get cleaned up if the thread gets destroyed. Tracking and cleaning that up is complex because if we do it, then it needs to happen before Python is finalized so we cannot use things like C++ destructors

Then I think that rules out making this cache thread-local. Otherwise we'd leak items in the cache whenever threads are cleaned up.

@kumaraditya303

Copy link
Copy Markdown
Contributor Author

Then I think that rules out making this cache thread-local. Otherwise we'd leak items in the cache whenever threads are cleaned up.

I think that's only a real problem if you create many short lived threads though. In a more realistic application, it would use something like a thread pool to reuse thread where this cache would help in performance and all the memory would be freed at exit like it currently does.

Also as far as I see, numpy currently does not clears freelists during GC or at exit so it holds on to memory in freelists the entire time, this issue is present on default builds as well. (CPython clears all freelists during the GC)

@kumaraditya303

kumaraditya303 commented Dec 23, 2025

Copy link
Copy Markdown
Contributor Author

I'll see if I can make this cache thread safe while keeping it global. That would still have the current issue of holding memory till process exit though.

@kumaraditya303 kumaraditya303 marked this pull request as draft December 23, 2025 18:20
@ngoldbaum

ngoldbaum commented Dec 23, 2025

Copy link
Copy Markdown
Member

Sebastian may have opinions too but IMO I'd rather not allow a possibly unbounded memory leak like that that is incurred for every new thread. I think short-lived worker threads are reasonably common.

@seberg

seberg commented Dec 23, 2025

Copy link
Copy Markdown
Member

I agree, that sounds like "I know that I am not doing this" thing. So I am not in favor of defaulting to something that will leak slowly for some (even uncommon) very long running programs (you may be able to convince me, but I think I would need input form people who may work in that space).
I suppose an opt-in would work, but I don't think that is worthwhile either really.

FWIW, I suspect we could do this type of thing by tying it to the thread with weakrefs, just like threading.local() does. Dunno if it's worthwhile if it is just this, but it may be OK if there are more such things.

@kumaraditya303

Copy link
Copy Markdown
Contributor Author

I have changed the implementation now to use C++ destructor to clear the cache at thread exit. I had missed the fact that this cache only stores free memory blocks and no PyObjects or incref/decrefs are involved. While it is not safe to call incref/decref etc in a C++ global destructor because Python can be finalized by that time, it is safe to call free to free the memory.

@kumaraditya303 kumaraditya303 marked this pull request as ready for review December 26, 2025 07:28

@ngoldbaum ngoldbaum left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Would appreciate a review from @seberg as well before merging.

Comment thread numpy/_core/src/multiarray/alloc.cpp Outdated
}
} cache_destructor;

static thread_local cache_destructor tls_cache_destructor;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if anyone is curious, the relevant bit of the C++ spec: https://timsong-cpp.github.io/cppwp/n4140/basic.start.term. Assuming the thread exits gracefully, this will be invoked on thread exit.

Comment thread numpy/_core/src/multiarray/alloc.cpp Outdated
Comment thread numpy/_core/src/multiarray/alloc.cpp Outdated
@kumaraditya303 kumaraditya303 added the 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703) label Dec 30, 2025
@kumaraditya303 kumaraditya303 requested a review from seberg January 4, 2026 13:17
@seberg

seberg commented Jan 5, 2026

Copy link
Copy Markdown
Member

Yeah, this should be good and I assume that the per-thread handling is completely fine (maybe even good). I also don't think that the free'ing should ever do anything (even in the future) that requires the Python interpreter to be alive.

So let's give this a shot, thanks @kumaraditya303.

@seberg seberg merged commit 83e2486 into numpy:main Jan 5, 2026
75 checks passed
@kumaraditya303 kumaraditya303 deleted the alloc branch January 5, 2026 09:50
@seberg

seberg commented Jan 18, 2026

Copy link
Copy Markdown
Member

This may sound silly, but is there a nice way to confirm that the deallocator actually runs? I have a new computer and ran some tests in valgrind (I actually should disable the cache for that).
Anyway, it reports lost allocations that fit the bill for being cached here, so it seems to me like the destructor may not be called, but the code and C++ logic feels like it should be...

@charris

charris commented Jan 18, 2026

Copy link
Copy Markdown
Member

Does the new computer run valgrind faster than the previous one? (asking the important questions)

@seberg

seberg commented Jan 19, 2026

Copy link
Copy Markdown
Member

I didn't do in a while, but I ran 16 processes and did a manual per-module xdist and it finished comfortably over night (I honestly don't know how long).
So yeah, but I am not sure how much of that xdist stuff I did before.

EDIT: but to be clear, I suspect that leak sanitizer is likely much faster and the better path, just that needs a bit of work (I haven't looked into it exactly). valgrind is slow, but it is pretty hands-off (e.g. works with stock Python, doesn't need much suppressions).

@kumaraditya303

Copy link
Copy Markdown
Contributor Author

This may sound silly, but is there a nice way to confirm that the deallocator actually runs? I have a new computer and ran some tests in valgrind (I actually should disable the cache for that).

I had tried adding print statements and it indeed works, can you share steps to reproduce the leak?

@seberg

seberg commented Jan 19, 2026

Copy link
Copy Markdown
Member

Hmmm, yeah, that works on my mac, but doesn't seem to work on linux with gcc!?

@seberg

seberg commented Jan 19, 2026

Copy link
Copy Markdown
Member

Hmmmm, this might be related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61991

EDIT: OTOH, looking more than 5 seconds... it seems a bit ancient.

@kumaraditya303

Copy link
Copy Markdown
Contributor Author

Can you try with recent compiler on linux like clang 19 or 20? I tried on linux and it works for me

@seberg

seberg commented Jan 19, 2026

Copy link
Copy Markdown
Member

@kumaraditya303 I tried with gcc 13 and 15.2.1, I assume it is gcc at fault here. I haven't figured out a solution yet. Using the struct in some function makes the destructor get called once at process shutdown (and not at thread shutdown?! -- i.e. test_multithreading.py should spit this out a lot with -- -s)
Adding a non-trivial constructor or removing the static didn't even help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants