ENH: enable alloc cache on free-threading builds#30499
Conversation
|
Hmmmm, if we do this, don't we leak the cache when threads get destroyed? We could tie cleanup somehow to the Python thread object, but that would add a bit of complexity. |
Yes, any cached object would not get cleaned up if the thread gets destroyed. Tracking and cleaning that up is complex because if we do it, then it needs to happen before Python is finalized so we cannot use things like C++ destructors. |
Then I think that rules out making this cache thread-local. Otherwise we'd leak items in the cache whenever threads are cleaned up. |
I think that's only a real problem if you create many short lived threads though. In a more realistic application, it would use something like a thread pool to reuse thread where this cache would help in performance and all the memory would be freed at exit like it currently does. Also as far as I see, numpy currently does not clears freelists during GC or at exit so it holds on to memory in freelists the entire time, this issue is present on default builds as well. (CPython clears all freelists during the GC) |
|
I'll see if I can make this cache thread safe while keeping it global. That would still have the current issue of holding memory till process exit though. |
|
Sebastian may have opinions too but IMO I'd rather not allow a possibly unbounded memory leak like that that is incurred for every new thread. I think short-lived worker threads are reasonably common. |
|
I agree, that sounds like "I know that I am not doing this" thing. So I am not in favor of defaulting to something that will leak slowly for some (even uncommon) very long running programs (you may be able to convince me, but I think I would need input form people who may work in that space). FWIW, I suspect we could do this type of thing by tying it to the thread with weakrefs, just like |
4e573cf to
db32836
Compare
|
I have changed the implementation now to use C++ destructor to clear the cache at thread exit. I had missed the fact that this cache only stores free memory blocks and no PyObjects or incref/decrefs are involved. While it is not safe to call incref/decref etc in a C++ global destructor because Python can be finalized by that time, it is safe to call |
| } | ||
| } cache_destructor; | ||
|
|
||
| static thread_local cache_destructor tls_cache_destructor; |
There was a problem hiding this comment.
if anyone is curious, the relevant bit of the C++ spec: https://timsong-cpp.github.io/cppwp/n4140/basic.start.term. Assuming the thread exits gracefully, this will be invoked on thread exit.
|
Yeah, this should be good and I assume that the per-thread handling is completely fine (maybe even good). I also don't think that the free'ing should ever do anything (even in the future) that requires the Python interpreter to be alive. So let's give this a shot, thanks @kumaraditya303. |
|
This may sound silly, but is there a nice way to confirm that the deallocator actually runs? I have a new computer and ran some tests in valgrind (I actually should disable the cache for that). |
|
Does the new computer run valgrind faster than the previous one? (asking the important questions) |
|
I didn't do in a while, but I ran 16 processes and did a manual per-module xdist and it finished comfortably over night (I honestly don't know how long). EDIT: but to be clear, I suspect that leak sanitizer is likely much faster and the better path, just that needs a bit of work (I haven't looked into it exactly). |
I had tried adding print statements and it indeed works, can you share steps to reproduce the leak? |
|
Hmmm, yeah, that works on my mac, but doesn't seem to work on linux with gcc!? |
|
Hmmmm, this might be related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61991 EDIT: OTOH, looking more than 5 seconds... it seems a bit ancient. |
|
Can you try with recent compiler on linux like clang 19 or 20? I tried on linux and it works for me |
|
@kumaraditya303 I tried with gcc 13 and 15.2.1, I assume it is gcc at fault here. I haven't figured out a solution yet. Using the struct in some function makes the destructor get called once at process shutdown (and not at thread shutdown?! -- i.e. |
This reverts commit 83e2486.
This reverts commit 83e2486.
This PR enables the alloc cache of free-threading builds by making it thread local. Verified it locally by running all numpy tests under pytest-run-parallel and all passed.
cc @ngoldbaum