Skip to content

ENH: interpolate: fix concurrency issues throughout#20671

Merged
ev-br merged 17 commits intoscipy:mainfrom
andfoy:check_nogil_interpolate
Jul 16, 2024
Merged

ENH: interpolate: fix concurrency issues throughout#20671
ev-br merged 17 commits intoscipy:mainfrom
andfoy:check_nogil_interpolate

Conversation

@andfoy
Copy link
Copy Markdown
Contributor

@andfoy andfoy commented May 8, 2024

Reference issue

See gh-20669

What does this implement/fix?

This PR will add concurrency tests for the main interpolation classes in SciPy. This is done in order to potentially discover any concurrency, race conditions or deadlocks that may appear when using multithreaded Python.

Additional information

  • BSpline: State is initialized in constructor, it shouldn't have any concurrency issues at first glance.
  • PPoly (and family): State (aka polynomial coefficients) can mutate under extend method, all other calls depend on the state
  • RbfInterpolator: There are no functions that can mutate an instance state, thus is thread-safe, as long KDTree is safe: ENH: spatial: ensure thread-safety for KDTree #20655
  • Rbf: There are no functions that can mutate an instance state.
  • NdBSpline: State is initialized in constructor, no concurrency issues found
  • NdPPoly: State is initialized in constructor, no concurrency issues found
  • NearestNDInterpolator: State is initialized in constructor, no mutation methods are available. It depends on KDTree being thread-safe: ENH: spatial: ensure thread-safety for KDTree #20655
  • RegularGridInterpolator: A wrapped NdBSpline instance can mutate during the call, introducing unwanted side effects on a concurrent scenario. Fixed on 7eb6d59
  • LinearNDInterpolator/CloughTocherInterpolator: The NDInterpolatorBase does not have any state mutation methods. Since they depend on QHull access, concurrent calls will be serialized ENH: spatial: serialize concurrent calls to QHull #20619
  • UnivariateSpline/_BivariateSplineBase (and family): According to Mark fitpack sources as recursive #16053 (comment), all FORTRAN code that is declared with the recursive keyword is reentrant and thread safe, as long as there are not any variables declared with the save keyword. The port of FITPACK used by SciPy is not using save, and all routines are declared as recursive. Classes that use the FITPACK backend base classes (aka UnivariateSpline and _BivariateSplineBase) do not have any state-mutating public function different from the constructor, so they should be thread-safe in principle.

@github-actions github-actions bot added scipy.interpolate maintenance Items related to regular maintenance tasks labels May 8, 2024
@andfoy andfoy changed the title TST: Add concurrency tests throughout scipy.interpolate ENH: Fix concurrency issues throughout scipy.interpolate May 14, 2024
Comment on lines +1016 to +787
with self._c_lock:
pass

Copy link
Copy Markdown
Contributor Author

@andfoy andfoy May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an optimistic-lock; it expects that multiple threads to be calling non-mutating functions most of the time, allowing them to run concurrently. It behaves like a turnstile, allowing threads that arrived first to pass, and then blocks if another thread calls extend, until it finishes updating the polynomial coefficients.

While this solution may yield dirty-reads for threads that execute before extend, it allows for better concurrency by not serializing each call that depends on the coefficients, which would slowdown the overall execution time when multiple threads are calling instructions different from extend.

An alternative would be to signal in the docs that the API is not thread-safe with respect to extend and let the user handle the concurrency explicitly when a mixture of extend and other calls happen

Copy link
Copy Markdown
Member

@ev-br ev-br Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this comment, and most certainly a comment is needed not only on GH but also in code.

Also, if it's something to do with the extend method, can the lock move to extend? If not, why it's only in __call__ but not in other functions (.derivative, .integrate etc)

Copy link
Copy Markdown
Member

@ev-br ev-br Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one still confuses me, I've to admit. Let me try rephrasing my confusion:

  1. This with-stanza acquires and releases the lock, so that if two threads arrive simultaneously, one arrives to this line and the other is doing something else, that other one is stopped and waits until the lock is released -- -is this correct? If it is, why is it needed, why is it useful and how does this help anything?

  2. The very need for this is because of existence of .extend method which mutates the class state, is this correct? If correct,cannot one lock the .extend internals instead?

  3. What is this pattern, is it something known/standard or is it something you conjured out of thin air?

  4. Can you actually show a scenario where this is needed? And explain how it works in this scenario?
    Previous discussion seems to suggest something rather esoteric, with threaded use of reads and .extend --- a concrete scenario would really help.

Copy link
Copy Markdown
Contributor Author

@andfoy andfoy Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the first comment, this lock acts like a "turnstile with a block"; imagine n persons entering an amusement park through a turnstile, once they are in, they are independent of each other, despite entering in a serialized fashion. Now, imagine there's a service mode that the turnstile can enter in which it blocks while it goes maintenance, during this time no people can enter the park, until the service mode gets released.

This is the same scenario posed here, what's behind the wall is the code that can be run by each thread independent of each other, assuming that the polynomial coefficients haven't changed at all. What the so-called "service mode" in the analogy does is changing the coefficients while stopping everyone else from accessing them.

When the polynomial undergoes modifications (via extend), any thread that has already got in, will get a dirty read if they are trying to access the polynomial coefficients, that's why is called optimistic in principle, you'd expect that the number of calls to extend be way much lower than the calls to the actual interpolation function. In contrast, a pessimistic approach would be to block and serialize any call to the interpolation function, given that at any time a call to extend would lead to dirty reads for any thread.

While the pessimistic approach guarantees clean reads and consistency, it comes at the expense of performance, since the calls performed by n threads would be effectively serialized. The optimistic approach tries to balance performance at the expense of potential dirty reads that are expected to be low, since the number of calls to extend is expected to occur in a non-concurrent fashion or in few occurrences.

This is more of a thought experiment that can be an exotic scenario in the practice, that's why I posed the question about signaling in the docs that calls to extend should not be interleaved with concurrent calls to any of the interpolation subclasses that inherit from _PPolyBase, since it is not thread-safe.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, the turnstile analogy I get, thank you. That the implementation looks bizarre can be attributed to me not being used to this, so can discount that :-).

What I'm not sure about is the status of this:

  • this is a best-effort trick, to remedy some of free-threading usages, correct?
  • (if yes) is there a way to estimate how exotic should a user load be for this trick to not work?

Let's consider an example: https://docs.scipy.org/doc/scipy/tutorial/interpolate/extrapolation_examples.html#cubicspline-extend-the-boundary-conditions

  • if we keep land this PR as is, how would the example need changing?
  • if we land the PR without this turnstile, how would the example need changing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutation goes against the guidelines declared on https://github.com/Quansight-Labs/free-threaded-compatibility?tab=readme-ov-file#suggested-plan-of-attack, also SciPy should not anticipate exotic use cases like the one that initially motivated this whole "optimistic lock" approach.

The resolution in this case was to document the extend function and mark it as not thread-safe when used with interleaved interpolation methods.

@andfoy andfoy force-pushed the check_nogil_interpolate branch from 061aa6f to 7edd025 Compare May 14, 2024 21:27
@Kai-Striega
Copy link
Copy Markdown
Member

@andfoy this looks like a good initiative. Let me know when you require someone to review.

@andfoy
Copy link
Copy Markdown
Contributor Author

andfoy commented May 16, 2024

Thanks @Kai-Striega, I'll expect to complete the overview by next week, however, feel free to comment and review on any of the current changes

@ev-br
Copy link
Copy Markdown
Member

ev-br commented May 16, 2024

The case of PPoly.extend is a curious one indeed.

  • the method mutates the state of self;
  • the only way to get into trouble in the nogil world is for a user to explicitly use multithreading on the python level.
  • the user load should be quite specific, maybe even contrived (N threads __call__ and (N+1)-th thread calls .extend, something like this)

Question: who's responsible for locking, 1) the library or 2) a user? More specifically, if there's a threading issue in user code, where the bug is: in the library or in the userland?

If 1), then every library in the ecosystem needs to add locking to all mutating methods. Thus any threading violation is valid bug report which we ought to fix.
If 2), it would be great to have a piece of documentation to point users to, ideally that docs contains a proper threading incantation with how to lock ("copy-paste this snippet and add level of indentation to your code").

ISTM there are two questions here, in fact. The first one is "who's responsible for locking" and the answer is, I strongly suspect is "it depends".
The second question is what is the default responsibility. If a method is mutating and is not explicitly documented as thread-safe or thread-unsafe, which one it is? I think explicit documentation is for exceptional cases, so the default is rather important to establish at least SciPy-wide, maybe even ecosystem-wide.

@lucascolley lucascolley changed the title ENH: Fix concurrency issues throughout scipy.interpolate ENH: interpolate: fix concurrency issues throughout May 17, 2024
@rgommers
Copy link
Copy Markdown
Member

ISTM there are two questions here, in fact. The first one is "who's responsible for locking" and the answer is, I strongly suspect is "it depends".

Not really - the library is responsible. Wherever we do something that isn't thread-safe, it requires a lock. That was already true before free-threaded CPython (e.g. we had a lock around messagestream in scipy._lib); the difference now is that a lot more code is potentially unsafe because the GIL no longer provides implicit locking for calls that go via a Python layer.

The second question is what is the default responsibility. If a method is mutating and is not explicitly documented as thread-safe or thread-unsafe,

For explicitly mutating methods (which should be very rare in SciPy), then of course the user has to be careful - doing that in multiple threads is not going to happen in a predefined order. I'm not sure why you're asking though - can you give an example of a SciPy function/method that is relevant here?

@ev-br
Copy link
Copy Markdown
Member

ev-br commented May 22, 2024

not sure why you're asking though - can you give an example of a SciPy function/method that is relevant here?

The original comment ( motivated by an offline discussion with @andfoy ) was about PPoly.extend. But in fact it's simpler:
PPoly.c is a public attribute, so a user may modify it in one thread and evaluate in another.

Basically, this PR adds locks all around. This guards against all sorts of contrived scenarios. And I'm talking about exactly that: what are the lengths we want to go to guard against contrived usage.

@ev-br
Copy link
Copy Markdown
Member

ev-br commented May 22, 2024

Now that it's testing threaded workloads, it'd be great to pick up tests cases from
#14162 and #11828

This is basically to double-check these issues do not make a comeback in a nogil world :-). This will also give us more confidence in general f2py wrapper fortran parts in various submodules.

@andfoy
Copy link
Copy Markdown
Contributor Author

andfoy commented May 22, 2024

The Alpine test segfault is fixed if scipy-openblas is used. I'll skip the test until #20585 gets merged

@rgommers
Copy link
Copy Markdown
Member

I'll skip the test until #20585 gets merged

That's done now at long last:)

@andfoy andfoy force-pushed the check_nogil_interpolate branch from faec9f3 to 4c7fb19 Compare May 29, 2024 23:58
@lucascolley lucascolley added enhancement A new feature or improvement and removed maintenance Items related to regular maintenance tasks labels May 30, 2024
@andfoy andfoy force-pushed the check_nogil_interpolate branch from 46ef6a3 to a593b37 Compare May 30, 2024 16:48
@rgommers rgommers added this to the 1.15.0 milestone Jun 2, 2024
@rgommers
Copy link
Copy Markdown
Member

rgommers commented Jun 2, 2024

@andfoy is this still draft, or ready to go?

@andfoy
Copy link
Copy Markdown
Contributor Author

andfoy commented Jun 3, 2024

This is missing a minor test before review

@andfoy andfoy force-pushed the check_nogil_interpolate branch from a593b37 to 8a35122 Compare June 4, 2024 21:15
@lucascolley lucascolley added the free-threading Items related to supporting free-threaded (a.k.a. "no-GIL") builds of CPython label Jun 8, 2024
@andfoy andfoy marked this pull request as ready for review June 11, 2024 17:47
@andfoy
Copy link
Copy Markdown
Contributor Author

andfoy commented Jun 18, 2024

@ev-br, this one is ready for review!

Copy link
Copy Markdown
Member

@ev-br ev-br left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very thorough.
However as long as we're to maintain this, there's a bunch of threadsafety noob questions, below.

Comment on lines +1016 to +787
with self._c_lock:
pass

Copy link
Copy Markdown
Member

@ev-br ev-br Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this comment, and most certainly a comment is needed not only on GH but also in code.

Also, if it's something to do with the extend method, can the lock move to extend? If not, why it's only in __call__ but not in other functions (.derivative, .integrate etc)

@andfoy andfoy force-pushed the check_nogil_interpolate branch from 3d323b1 to 7b37d77 Compare June 25, 2024 18:06
Copy link
Copy Markdown
Member

@ev-br ev-br left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Am flagging it as "changes requested" partly to avoid this slipping in everybody's TODO lists)

FWIW, I'm still confused by the "optimistic lock" stanza. Would really appreciate some explanation. And if we're to maintain it, some explanations are really needed to stay code comments.

Also would be great to document some about the use of the test helper --- which looks great, just needs some more docs.

Comment on lines +1016 to +787
with self._c_lock:
pass

Copy link
Copy Markdown
Member

@ev-br ev-br Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one still confuses me, I've to admit. Let me try rephrasing my confusion:

  1. This with-stanza acquires and releases the lock, so that if two threads arrive simultaneously, one arrives to this line and the other is doing something else, that other one is stopped and waits until the lock is released -- -is this correct? If it is, why is it needed, why is it useful and how does this help anything?

  2. The very need for this is because of existence of .extend method which mutates the class state, is this correct? If correct,cannot one lock the .extend internals instead?

  3. What is this pattern, is it something known/standard or is it something you conjured out of thin air?

  4. Can you actually show a scenario where this is needed? And explain how it works in this scenario?
    Previous discussion seems to suggest something rather esoteric, with threaded use of reads and .extend --- a concrete scenario would really help.

@andfoy andfoy force-pushed the check_nogil_interpolate branch 2 times, most recently from 98b0f70 to 9ea9c0b Compare July 9, 2024 20:09
@ev-br
Copy link
Copy Markdown
Member

ev-br commented Jul 10, 2024

This now LGTM, modulo the last small ask: could you squash the commits somewhat? Esp the debugging back-and-forth on alpine etc.

@andfoy andfoy force-pushed the check_nogil_interpolate branch from 9ea9c0b to 3140dca Compare July 10, 2024 17:04
@ev-br
Copy link
Copy Markdown
Member

ev-br commented Jul 11, 2024

Can you also remove the change to cobyqa? I think a git submodule update --init would fix it.

@andfoy andfoy force-pushed the check_nogil_interpolate branch from 4f4c078 to 268e03a Compare July 15, 2024 17:47
Copy link
Copy Markdown
Member

@ev-br ev-br left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now!
Doc build failures are unrelated, #21195
Thank you @andfoy , great to see the free-threading test suite and fixes.

@ev-br ev-br merged commit b497017 into scipy:main Jul 16, 2024
@andfoy andfoy deleted the check_nogil_interpolate branch July 16, 2024 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement A new feature or improvement free-threading Items related to supporting free-threaded (a.k.a. "no-GIL") builds of CPython scipy.interpolate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants