Skip to content

ENH: spatial: ensure thread-safety for KDTree#20655

Closed
andfoy wants to merge 9 commits intoscipy:mainfrom
andfoy:test_kdtree_concurrency
Closed

ENH: spatial: ensure thread-safety for KDTree#20655
andfoy wants to merge 9 commits intoscipy:mainfrom
andfoy:test_kdtree_concurrency

Conversation

@andfoy
Copy link
Copy Markdown
Contributor

@andfoy andfoy commented May 6, 2024

Reference issue

See gh-20669

What does this implement/fix?

This PR adds a quick check to ensure that concurrent calls to KDTree do not cause any unexpected errors. This is specially handy in the context of No-GIL CPython.

Additional information

In principle, concurrent access to KDTree is allowed out-of-the-box, since there are no methods that can mutate the state of the tree besides the constructor.

@github-actions github-actions bot added scipy.spatial maintenance Items related to regular maintenance tasks labels May 6, 2024
@sturlamolden
Copy link
Copy Markdown
Contributor

sturlamolden commented May 7, 2024

I am a bit uncertain if property .tree is thread-safe or not. It might be sufficiently serialized by the GIL though. It can mutate the inner state.

@sturlamolden
Copy link
Copy Markdown
Contributor

sturlamolden commented May 7, 2024

In context of No-GIL CPython, there are several things that are not thread-safe. For example:

  • Calling query methods or pickling before __init__ has completed.
  • Property .tree
  • Serialization (depickling), I think
  • Rouge thread making call to __init__ (even extra calls cause problems, even now)

@andfoy
Copy link
Copy Markdown
Contributor Author

andfoy commented May 7, 2024

Thanks for the pointers @sturlamolden. I have the following questions and comments:

  1. Just to clarify, should we be locking the access to .tree before it gets initialized? If I understood correctly, the potential race condition is that T1 creates a KDTree instance with many points so the construction is deferred, then other thread T2 is created with a reference to the same tree as T1 and tries to call any method while the construction takes place in T1? I don't know if this is possible under No-GIL Python, maybe @ngoldbaum has more ideas about this kind of scenario?
  2. Is there a mechanism by which the tree state gets modified during a query call? I took a peek at some of the function definitions, and while concurrent calls would be sharing access to the tree values, they are not being modified on any capacity, unless there is some detail that we are overlooking here?
  3. Serialization is a good point regarding concurrency, however, since there's not a method that is able to add new points or modify the tree (assuming that 2 holds), there shouldn't be any issue regarding state storage. The only problem would be multiple threads trying to write to the same file, which should be (in my opinion) responsibility of the end user

@sturlamolden
Copy link
Copy Markdown
Contributor

The issue with .tree is that it is an expensive structure to create, and was added for the purpose of introspection and testing, i.e. to verify that a kd-tree is created as expected or for education (show students how it partitions space and data). Since it is normally not needed, its creation is deferred to when someone actually asks for it.

@sturlamolden
Copy link
Copy Markdown
Contributor

sturlamolden commented May 7, 2024

The more serious issue is that __init__ is serialized by the GIL. Also it needs to complete before query functions are called. If we do not have the GIL, we need to serialize it in such a way that is has completed before queries are allowed. However, we should not serialize query functions, only have them wait for the __init__ to finish, so I guess we could set a threading.Event upon finishing __init__ and have query functions wait for it. Normally it will be set before they are called. We must also use a threading.Lock within __init__. It can also be used to serialize access to .tree.

@andfoy
Copy link
Copy Markdown
Contributor Author

andfoy commented May 7, 2024

Now I understand everything, so, basically we should lock on tree, and also add an atomic flag that signals if the tree structure is ready to use.

@sturlamolden
Copy link
Copy Markdown
Contributor

__init__ is a callable Python function. So it must be serialized by a lock and it must set an atomic flag upon completion.

.tree must wait for the flag and also be internally serialized by an rlock.

Query functions must wait for the atomic flag set by __init__.

@andfoy
Copy link
Copy Markdown
Contributor Author

andfoy commented May 7, 2024

Thanks for the helpful discussion @sturlamolden!

@sturlamolden
Copy link
Copy Markdown
Contributor

No problem. When I wrote the code I depended on the GIL for serialization. For example it ensures that __init__ has completed before queries can be called.

@sturlamolden
Copy link
Copy Markdown
Contributor

sturlamolden commented May 7, 2024

Actually it is already broken as __init__ now calls Python functions instead of the NumPy C API directly, so context switch on the GIL is possible while running __init__. It is hard to create though, but it could result in a race conditon or segfault. Which means it needs the serialization described for No-GIL CPython.

@andfoy
Copy link
Copy Markdown
Contributor Author

andfoy commented May 7, 2024

Great! Then this PR will fix the concurrency issues in KDTree, let me add the changes and update the title

@andfoy andfoy changed the title TST: Add test to check for thread-safety in KDTree ENH: Ensure thread-safety for KDTree May 7, 2024
@andfoy andfoy changed the title ENH: Ensure thread-safety for KDTree ENH: spatial: Ensure thread-safety for KDTree May 7, 2024
@andfoy andfoy changed the title ENH: spatial: Ensure thread-safety for KDTree ENH: spatial: ensure thread-safety for KDTree May 7, 2024
@lucascolley lucascolley added enhancement A new feature or improvement free-threading Items related to supporting free-threaded (a.k.a. "no-GIL") builds of CPython and removed maintenance Items related to regular maintenance tasks labels May 7, 2024
@sturlamolden
Copy link
Copy Markdown
Contributor

Note that my comments refer to cKDTree. KDTree is a Python wrapper on top of it so it will be necessary to fix both.

@ngoldbaum
Copy link
Copy Markdown
Contributor

I don't know if this is possible under No-GIL Python, maybe @ngoldbaum has more ideas about this kind of scenario?

Yes, in free-threaded python when the GIL is disabled then something needs to explicitly lock access to prevent data races. If the data race happens in python-level code, then a threading.Lock or other python synchronization primitive is needed.

@sturlamolden
Copy link
Copy Markdown
Contributor

sturlamolden commented May 7, 2024

We already make use of the threading module in cKDTree´s Cython code so it is not just for Python code. The problems I mentioned happens in Cython level code, which should have been protected by the GIL.

@ngoldbaum
Copy link
Copy Markdown
Contributor

Right, C-level code will need to use C-level locking primitives.

@sturlamolden
Copy link
Copy Markdown
Contributor

Use the primitives in threading like the rest of cKDTree, not cpython.pythread.

@sturlamolden
Copy link
Copy Markdown
Contributor

sturlamolden commented May 7, 2024

Also you cannot use a bool as an atomic flag. It must be an atomic variable, which a bool is not. Either you use platform-dependet C++ code for atomic read and write on a std::atomic<bool>, or you use threading.Event which takes care of these details. Basically to the latter, because we do not need that kind of microoptimizations here. Also if you use atomic read and write, you must provide your own spinlocks in the query functions, which I am certain we do not want to mess with. Just use threading.Event.

@andfoy andfoy force-pushed the test_kdtree_concurrency branch from 254971c to 14a0751 Compare May 30, 2024 16:49
Copy link
Copy Markdown
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locking of cKDTree.tree looks good to go. I don't think the added test currently tests the right thing though - it calls .query_ball_point, but that doesn't call .tree right? It seems to me that a second test, or a second for-loop in the same test, should try to directly access .tree from inside threads.

It would also be good to add a summary of how the added test fails under free-threaded CPython if the cKDTree code does not get a lock (what's the failure mode, is the test failing 100% of the time or intermittently).

@andfoy andfoy force-pushed the test_kdtree_concurrency branch from 14a0751 to 41f8698 Compare July 9, 2024 20:33
@rgommers
Copy link
Copy Markdown
Member

@andfoy is this ready again from your perspective?

@sturlamolden
Copy link
Copy Markdown
Contributor

I think we need a lock in all the query methods as well.
with self._three_lock: pass
should suffice.

The pickle methods also need serialization.

Currently it only puts a lock on the initialization, which is far from sufficient.

@andfoy
Copy link
Copy Markdown
Contributor Author

andfoy commented Jul 11, 2024

@rgommers I need to take @sturlamolden considerations into account, also the added test needs some fixing/debug

@andfoy andfoy force-pushed the test_kdtree_concurrency branch from bf189a5 to c9a80e0 Compare January 24, 2025 15:51
@andfoy
Copy link
Copy Markdown
Contributor Author

andfoy commented Jan 24, 2025

I undertook a detailed look throughout all the query methods in C++ and I couldn't find any instruction that could mutate the tree state, therefore those methods should not require any locking.

@sturlamolden
Copy link
Copy Markdown
Contributor

I undertook a detailed look throughout all the query methods in C++ and I couldn't find any instruction that could mutate the tree state, therefore those methods should not require any locking.

The query methods do not mutate the tree state themselves, but oher methods may while the queries are running. This can cause segfaults or errorneous results. The queries muct therefore lock the kd-tree for self-preservation.

Your comment indicate a lack of understanding of concurrent systems. If you do not understand something this simple, please do not contribute code that uses or controls concurrency. And I am sorry for using direct language, I am Scandinavian.

@andfoy
Copy link
Copy Markdown
Contributor Author

andfoy commented Jan 24, 2025

@sturlamolden, sorry, but you shouldn't assume the level of knowledge of contributors, neither impose conditions about who should contribute to which topics and not. For your information, I dedicated part of my studies to study concurrency, parallelism and distribution.

I'm sorry, but being Scandinavian doesn't exclude you from respecting people.

I'm going to apply the changes, given that you know the codebase better.

Just for clarity, the comment was made assuming that users do not interleave mutating calls from queries to the tree, which delegates concurrency handling to them as opposed to the library trying to anticipate use-cases, given that locking causes performance drops under parallelism.

@rgommers
Copy link
Copy Markdown
Member

@sturlamolden please don't tell others not to contribute, your comment seems a little insulting for no reason. You're effectively one of the few maintainers of KDTree, so if you wanted to do the work yourself instead of reviewing someone else's work, it'd be fine to request that. But absent such a request: this work to make KDTree thread-safe needs doing, and I appreciate @andfoy working on it (he has also made a number of previous thread-safety related contributions that almost all got merged already and are working well).

The query methods do not mutate the tree state themselves, but other methods may while the queries are running.

It would be helpful to identify those methods. The docs may be incomplete; there are only 6 documented methods and they all look safe.

If there are public methods (or direct attribute access) to mutate the data structure, then the optimal approach is probably somewhere in the middle: some sort of locking is needed, but it is possible to run several query methods in parallel when those don't mutate themselves. This can be done with a "multiple readers single writer" lock for example, e.g. std::shared_mutex.

@sturlamolden
Copy link
Copy Markdown
Contributor

sturlamolden commented Jan 27, 2025

If there are public methods (or direct attribute access) to mutate the data structure, then the optimal approach is probably somewhere in the middle: some sort of locking is needed, but it is possible to run several query methods in parallel when those don't mutate themselves. This can be done with a "multiple readers single writer" lock for example, e.g. std::shared_mutex.

That is in effect what I suggested for the query functions.

with self._three_lock: pass

will wait for the writer (the setup functions) to finish, while not blocking out readers (the query functions). This must be the first thing the query functions do.

@crusaderky
Copy link
Copy Markdown
Contributor

Hello, any update on this?

@ngoldbaum
Copy link
Copy Markdown
Contributor

ngoldbaum commented Mar 12, 2026

I opened #24799 which supersedes this PR. See my PR description for why I didn't start with the code here.

@rgommers
Copy link
Copy Markdown
Member

Superseded by gh-24799, so closing. Thanks @andfoy & reviewers.

@rgommers rgommers closed this Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement A new feature or improvement free-threading Items related to supporting free-threaded (a.k.a. "no-GIL") builds of CPython scipy.spatial

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants