Skip to content

Fixup a refnanny-related performance regression#5685

Merged
scoder merged 10 commits intocython:masterfrom
da-woods:refnanny-nogil
Sep 29, 2023
Merged

Fixup a refnanny-related performance regression#5685
scoder merged 10 commits intocython:masterfrom
da-woods:refnanny-nogil

Conversation

@da-woods
Copy link
Copy Markdown
Contributor

@da-woods da-woods commented Sep 6, 2023

Originally reported in scikit-learn/scikit-learn#27086

Essentially:

  • I'd made functions raising an exception require refnanny even since prange/parallel block exception handling required it
  • I actually don't think this is sufficient and that there's other ways of raising an exception within a parallel block.
  • The upshot is that any function with a call to a function with a checked exception now requires refnanny.
  • For functions with object arguments (i.e. any cdef class method) that leads to the GIL always being acquired around the refnanny setup (even when refnanny is disabled).

I've fixed it by avoiding using refnanny in parallel blocks unless we know it's available. I think it's too hard to tell reliably ahead of time (at least for me).

I think the have_object_args and has_with_gil_block criteria for acquiring the GIL around function definitions might be excessive, and possibly unnecessary some of the time. It's possibly mainly reassigned object args. This might be worth refactoring but I don't think it's necessary to fix this particular performance regression.

Closes #5700

See scikit-learn/scikit-learn#27086

Essentially:
* I'd made functions raising an exception require refnanny even
  since prange/parallel block exception handling required it
* I actually don't think this is sufficient and that there's other
  ways of raising an exception within a parallel block.
* The upshot is that *any* function with a call to a function with
  a checked exception now requires refnanny.
* For functions with object arguments (i.e. any cdef class method)
  that leads to the GIL always being acquired around the refnanny
  setup (even when refnanny is disabled).

I've fixed it by avoiding using refnanny in parallel blocks
unless we know it's available. I think it's too hard to tell
reliably ahead of time (at least for me).

I think the `have_object_args and has_with_gil_block` criteria
for acquiring the GIL around function definitions might be
excessive, and possibly unnecessary some of the time. It's possibly
mainly reassigned object args. This might be worth refactoring
but I don't think it's necessary to fix this particular
performance regression.
@scoder
Copy link
Copy Markdown
Contributor

scoder commented Sep 7, 2023

I haven't seen the resulting code, but if this fixes the problem at hand, it's probably ok for the time being.

If the issue is acquiring the GIL uselessly at call time just for the (unused) refnanny, then

  1. __Pyx_RefNannySetupContext() can acquire the GIL itself at need, it just needs 1 as second argument
  2. we might be able to generate the refnanny setup and the entry acquire-GIL/release-GIL dance in retrospect after generating the whole function, when we know if we actually needed it along the way or not.

@da-woods
Copy link
Copy Markdown
Contributor Author

da-woods commented Sep 8, 2023

we might be able to generate the refnanny setup and the entry acquire-GIL/release-GIL dance in retrospect after generating the whole function, when we know if we actually needed it along the way or not.

Yes - that's probably the way to do it for both of them. Unless there's another release imminent I'll try to do that rather than merging this right now.

__Pyx_RefNannySetupContext() can acquire the GIL itself at need, it just needs 1 as second argument

Yes - and that'd presumably be a non-op without the refnanny so be reasonably OK. I think the issue was a combination of refnanny + entry-require/release code that wasn't actually needed.

@djhoese
Copy link
Copy Markdown

djhoese commented Sep 8, 2023

Sorry for commenting on this pull request, but I was working on a MCVE so I could file a bug report for what I think this PR is trying to fix. I don't know enough about the ref nanny, but it sounds like if I have a class method, then even if it is nogil the ref nanny will still need to be created and will usually (?) need the GIL to be created?

I have a decently simple method that accepts a ton of memory views and doesn't return anything (void), but it does call other nogil methods of the class. With just this basic description, does this sound like a place where the ref nanny is needed and should acquire the GIL. The generated C has this at the top of the declaration:

static void __pyx_fuse_1__pyx_f_12geotiepoints_19_modis_interpolator_17MODISInterpolator__get_atrack_ascan(struct __pyx_obj_12geotiepoints_19_modis_interpolator_MODISInterpolator *__pyx_v_self, __Pyx_memviewslice __pyx_v_satz, __Pyx_memviewslice __pyx_v_x, __Pyx_memviewslice __pyx_v_y, __Pyx_memviewslice __pyx_v_c_exp_coarse, __Pyx_memviewslice __pyx_v_c_ali_coarse, __Pyx_memviewslice __pyx_v_c_exp_fine, __Pyx_memviewslice __pyx_v_c_ali_fine, __Pyx_memviewslice __pyx_v_a_track, __Pyx_memviewslice __pyx_v_a_scan) {
  __Pyx_memviewslice __pyx_v_satz_a_view = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_satz_b_view = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_c_exp_view2 = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_c_ali_view2 = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_RefNannyDeclarations
  __Pyx_memviewslice __pyx_t_1 = { 0, 0, { 0 }, { 0 }, { 0 } };
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;
  #ifdef WITH_THREAD
  PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
  #endif
  __Pyx_RefNannySetupContext("__pyx_fuse_1_get_atrack_ascan", 0);
  #ifdef WITH_THREAD
  __Pyx_PyGILState_Release(__pyx_gilstate_save);
  #endif

If this GIL acquire is expected, then if I extracted this method to be outside of the class would you expect the GIL acquire and ref nanny to be unnecessary?

@da-woods
Copy link
Copy Markdown
Contributor Author

da-woods commented Sep 8, 2023

@djhoese that looks like exactly the same bug and should be fixed by this PR. Thanks for your efforts towards preparing a good bug report though, even if not needed

@djhoese
Copy link
Copy Markdown

djhoese commented Sep 8, 2023

Ok so this is a bug. Good because I'm having trouble recreating it in a simple example.

@djhoese
Copy link
Copy Markdown

djhoese commented Sep 8, 2023

FYI I tried installing this PR and running my code through it and the Ensure/Release around the __Pyx_RefNannySetupContext is still there.

@da-woods
Copy link
Copy Markdown
Contributor Author

da-woods commented Sep 8, 2023

FYI I tried installing this PR and running my code through it and the Ensure/Release around the __Pyx_RefNannySetupContext is still there.

I'm going to have a bit of a rewrite of this PR with a slightly more intelligent approach. So possibly wait for that before spending too much time filing a separate bug.

@djhoese
Copy link
Copy Markdown

djhoese commented Sep 8, 2023

Yeah I've been walking through the Node.py code and figuring out what the difference is between my test code that doesn't reproduce the issue and the realworld code that does. Currently getting lost in if self is considered a python argument and if that is different in my test code.

@scoder
Copy link
Copy Markdown
Contributor

scoder commented Sep 9, 2023

We have neither a ticket nor a reproducing test, though. I think both would help.

@djhoese
Copy link
Copy Markdown

djhoese commented Sep 9, 2023

Ok I tracked it down quite a bit. Here's the example:

# cython: language_level=3, boundscheck=False, cdivision=True, wraparound=False, initializedcheck=False, nonecheck=False
cimport cython
import numpy as np
cimport numpy as np

np.import_array()

def test_func():
    cdef np.ndarray[float, ndim=2] arr = np.zeros((5, 5), dtype=np.float32)
    cdef float[:, ::1] arr_view = arr
    t = Test(5.0)
    t.call_me(arr_view)


cdef class Test:

    cdef float _a

    def __cinit__(self, float a):
        self._a = a

    cdef void call_me(self, float[:, ::1] my_arr) noexcept:
        with nogil:
            self._call_me(my_arr)

    cdef void _call_me(self, float[:, ::1] my_arr) noexcept nogil:
        cdef Py_ssize_t idx
        cdef float[:, :] my_arr2 = _get_upper_left_corner(my_arr)
        for idx in range(my_arr2.shape[0]):
            my_arr2[idx, 0] = self._a


cdef inline float[:, :] _get_upper_left_corner(float[:, ::1] arr) noexcept nogil:
    return arr[:3, :3]

This is a version from my other recent issues like #5670 and the original SO question here: https://stackoverflow.com/questions/76972950/cython-returned-memoryview-is-always-considered-uninitialized/76974615

The result of those discussions was that declaring a C function that returns a memory view as noexcept was pointless because there would always be some exception handling on the error case of slicing/clearing/initializing the memoryviews. However, in the above example if you remove noexcept from _get_upper_left_corner then the function is marked as having exception handling so it hits this block:

if (func_type.exception_value is not None or
func_type.exception_check and func_type.exception_check != '+'):
if env.nogil:
# need to generate refnanny for testing because of exception check
env.has_with_gil_block = True

Setting has_with_gil_block which makes acquire_gil_for_var_decls_only True here:

https://github.com/da-woods/cython/blob/81db672ab7ab6510ef6f761f1ce481d731b8b583/Cython/Compiler/Nodes.py#L2066-L2068

I'm still not entirely sure why this generates the C code exactly as it does (GIL ensure/release only wrapping a refnanny), but my guess is that it is ensuring the GIL for variable declarations but I don't have any variables that need it here. It thinks that with the with gil (that doesn't actually exist) and the self argument as a Python object that it needs to declare variables with the GIL...I think. About here in the code:

https://github.com/da-woods/cython/blob/81db672ab7ab6510ef6f761f1ce481d731b8b583/Cython/Compiler/Nodes.py#L2066-L2079

Then it does the refnanny because it is a nogil function and thinks it has a with gil block internally. Then when it is done "declaring variables" and the ref nanny it releases the GIL. There just weren't actually any variables that needed to be declared. At least that's my guess.

Comment on lines +2193 to +2194
if type.refcounting_tracked_by_refnanny and self.funcstate:
self.funcstate.needs_refnanny = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this code multiplication is awful. Why not at least move this into PyObjectType? In the methods there, you could just always say code.funcstate.needs_refnanny = nanny/True, without further conditions.

That said, if these refcounting code bits are really the only point where we can take this decision, then the granularity is ok. But the question remains if we need to decide it here, at the very bottom of the code generation. Isn't it enough to check the types of named variables and temp variables after generating the function? Or, if we have local (parallel?) temp scoping for some reason, let the code funcstate remember whether it created any Python object temps along the way, and check for that flag.

You probably put a lot of thoughts and effort into this, so, could you explain why keeping the decision in the function generation code (rather than the low-level refcounting code) can't work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this code multiplication is awful. Why not at least move this into PyObjectType? In the methods there, you could just always say code.funcstate.needs_refnanny = nanny/True, without further conditions.

That would be better - I'd missed that as an option. I'll do that

But the question remains if we need to decide it here, at the very bottom of the code generation. Isn't it enough to check the types of named variables and temp variables after generating the function? Or, if we have local (parallel?) temp scoping for some reason, let the code funcstate remember whether it created any Python object temps along the way, and check for that flag.

That's what we did before, and it mostly works. I think we'd mainly need to add a flag that things creating local Python variables can set. Where our current code fails a little is for Python object arguments that aren't reassigned (e.g. self quite frequently) which don't need reference counting, but get considered like they do. The nice thing about just tracking all decref/gotref/incref/etc is that it should give the right answer.

Comment on lines +1386 to +1387
if code.funcstate:
code.funcstate.needs_refnanny = code.funcstate.needs_refnanny or nanny
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if code.funcstate:
code.funcstate.needs_refnanny = code.funcstate.needs_refnanny or nanny
if code.funcstate and nanny:
code.funcstate.needs_refnanny = True

Copy link
Copy Markdown
Contributor Author

@da-woods da-woods Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funcstate looks to be unset in the module deallocation functions. nanny is set to false though, so testing it first is enough to not need the funcstate check I think.


def generate_incref(self, code, cname, nanny):
if nanny:
if code.funcstate:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case do we not have a funcstate? Maybe we should get rid of that case instead?
I can't think of a place where we generated increfs outside of user/wrapper functions or the module init function.

Comment on lines +42 to +44
// Note - I'm using a 5 second timeout here to avoid the test
// being "deadlock forever". This could occasionally give false
// failures if things are running really slowly
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be annoying. I dislike uncontrolled concurrency in tests for pretty much this reason. Threads are one of those tools that programmers easily consider appropriate and that rarely are. Especially if test failures necessarily make the code run into timeouts, rather than failing timely.

Some people, when confronted with a problem, say "I know, I'll use threads". And now problems two have they. :)

Can't we use a C code validation check that errors if it finds any refnanny usage before the module init function, or something like that? That seems doable with a regex. (Which, admittedly, is the other tool which gives you two problems from one for free…)

We could also extend the C code validation to allow start and end markers, to match a specific function or section. sed uses /startregex/,/endregex/ command, for example, where start and end are optional. For our use case, we could use something like /startregex/,/endregex/ innerregex, and fail if the inner regex is (not) found, as well as if start or end are not found (because we always expect those to be present either way).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved C code validation is implemented in 09bb9e1

Pattern format is /start/:/end/pattern, or something like /start/ : /end/ pattern if you prefer adding whitespace. Both start and end are optional.

Copy link
Copy Markdown
Contributor Author

@da-woods da-woods Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a C code validation test (but not today...).

I'd like to leave this test in though - I think it is the "right" test and making sure the actual problem doesn't recur.. I think the only thing that is non-deterministic is the slight chance it takes more than 5 seconds. But realistically that's a huge overestimate of the amount of time it needs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some C code tests. I think we should skip the more complicated patterns on annotated HTML though - they seem almost impossible to make work there

@scoder scoder merged commit b622247 into cython:master Sep 29, 2023
@scoder
Copy link
Copy Markdown
Contributor

scoder commented Sep 29, 2023

Thanks for fixing this. It's sad to have performance regressions in Cython 3.0 compared to 0.29.x that keep people from adopting it.

@da-woods da-woods deleted the refnanny-nogil branch September 29, 2023 10:20
@scoder
Copy link
Copy Markdown
Contributor

scoder commented Oct 1, 2023

There is a bug in the logic for closure functions. We generate the "ensure GIL" part (due to var_decls_definitely_need_gil), but not the "release GIL" part. You can see the unused variable '__pyx_gilstate_save' messages in the test logs. We incref the closure reference, so it's actually needed, but it's never cleaned up (unless lenv.nogil is set, which shouldn't be the case for closure functions).

@scoder
Copy link
Copy Markdown
Contributor

scoder commented Oct 2, 2023

There is a bug in the logic for closure functions.

Resolved in #5738

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Gil required but seems unnecessary

4 participants