Fixup a refnanny-related performance regression#5685
Conversation
See scikit-learn/scikit-learn#27086 Essentially: * I'd made functions raising an exception require refnanny even since prange/parallel block exception handling required it * I actually don't think this is sufficient and that there's other ways of raising an exception within a parallel block. * The upshot is that *any* function with a call to a function with a checked exception now requires refnanny. * For functions with object arguments (i.e. any cdef class method) that leads to the GIL always being acquired around the refnanny setup (even when refnanny is disabled). I've fixed it by avoiding using refnanny in parallel blocks unless we know it's available. I think it's too hard to tell reliably ahead of time (at least for me). I think the `have_object_args and has_with_gil_block` criteria for acquiring the GIL around function definitions might be excessive, and possibly unnecessary some of the time. It's possibly mainly reassigned object args. This might be worth refactoring but I don't think it's necessary to fix this particular performance regression.
|
I haven't seen the resulting code, but if this fixes the problem at hand, it's probably ok for the time being. If the issue is acquiring the GIL uselessly at call time just for the (unused) refnanny, then
|
Yes - that's probably the way to do it for both of them. Unless there's another release imminent I'll try to do that rather than merging this right now.
Yes - and that'd presumably be a non-op without the refnanny so be reasonably OK. I think the issue was a combination of refnanny + entry-require/release code that wasn't actually needed. |
|
Sorry for commenting on this pull request, but I was working on a MCVE so I could file a bug report for what I think this PR is trying to fix. I don't know enough about the ref nanny, but it sounds like if I have a class method, then even if it is I have a decently simple method that accepts a ton of memory views and doesn't return anything ( static void __pyx_fuse_1__pyx_f_12geotiepoints_19_modis_interpolator_17MODISInterpolator__get_atrack_ascan(struct __pyx_obj_12geotiepoints_19_modis_interpolator_MODISInterpolator *__pyx_v_self, __Pyx_memviewslice __pyx_v_satz, __Pyx_memviewslice __pyx_v_x, __Pyx_memviewslice __pyx_v_y, __Pyx_memviewslice __pyx_v_c_exp_coarse, __Pyx_memviewslice __pyx_v_c_ali_coarse, __Pyx_memviewslice __pyx_v_c_exp_fine, __Pyx_memviewslice __pyx_v_c_ali_fine, __Pyx_memviewslice __pyx_v_a_track, __Pyx_memviewslice __pyx_v_a_scan) {
__Pyx_memviewslice __pyx_v_satz_a_view = { 0, 0, { 0 }, { 0 }, { 0 } };
__Pyx_memviewslice __pyx_v_satz_b_view = { 0, 0, { 0 }, { 0 }, { 0 } };
__Pyx_memviewslice __pyx_v_c_exp_view2 = { 0, 0, { 0 }, { 0 }, { 0 } };
__Pyx_memviewslice __pyx_v_c_ali_view2 = { 0, 0, { 0 }, { 0 }, { 0 } };
__Pyx_RefNannyDeclarations
__Pyx_memviewslice __pyx_t_1 = { 0, 0, { 0 }, { 0 }, { 0 } };
int __pyx_lineno = 0;
const char *__pyx_filename = NULL;
int __pyx_clineno = 0;
#ifdef WITH_THREAD
PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
#endif
__Pyx_RefNannySetupContext("__pyx_fuse_1_get_atrack_ascan", 0);
#ifdef WITH_THREAD
__Pyx_PyGILState_Release(__pyx_gilstate_save);
#endifIf this GIL acquire is expected, then if I extracted this method to be outside of the class would you expect the GIL acquire and ref nanny to be unnecessary? |
|
@djhoese that looks like exactly the same bug and should be fixed by this PR. Thanks for your efforts towards preparing a good bug report though, even if not needed |
|
Ok so this is a bug. Good because I'm having trouble recreating it in a simple example. |
|
FYI I tried installing this PR and running my code through it and the Ensure/Release around the |
I'm going to have a bit of a rewrite of this PR with a slightly more intelligent approach. So possibly wait for that before spending too much time filing a separate bug. |
|
Yeah I've been walking through the Node.py code and figuring out what the difference is between my test code that doesn't reproduce the issue and the realworld code that does. Currently getting lost in if |
|
We have neither a ticket nor a reproducing test, though. I think both would help. |
|
Ok I tracked it down quite a bit. Here's the example: This is a version from my other recent issues like #5670 and the original SO question here: https://stackoverflow.com/questions/76972950/cython-returned-memoryview-is-always-considered-uninitialized/76974615 The result of those discussions was that declaring a C function that returns a memory view as cython/Cython/Compiler/ExprNodes.py Lines 6345 to 6349 in dd62cba Setting I'm still not entirely sure why this generates the C code exactly as it does (GIL ensure/release only wrapping a refnanny), but my guess is that it is ensuring the GIL for variable declarations but I don't have any variables that need it here. It thinks that with the Then it does the refnanny because it is a |
Cython/Compiler/Code.py
Outdated
| if type.refcounting_tracked_by_refnanny and self.funcstate: | ||
| self.funcstate.needs_refnanny = True |
There was a problem hiding this comment.
Wow, this code multiplication is awful. Why not at least move this into PyObjectType? In the methods there, you could just always say code.funcstate.needs_refnanny = nanny/True, without further conditions.
That said, if these refcounting code bits are really the only point where we can take this decision, then the granularity is ok. But the question remains if we need to decide it here, at the very bottom of the code generation. Isn't it enough to check the types of named variables and temp variables after generating the function? Or, if we have local (parallel?) temp scoping for some reason, let the code funcstate remember whether it created any Python object temps along the way, and check for that flag.
You probably put a lot of thoughts and effort into this, so, could you explain why keeping the decision in the function generation code (rather than the low-level refcounting code) can't work?
There was a problem hiding this comment.
Wow, this code multiplication is awful. Why not at least move this into
PyObjectType? In the methods there, you could just always saycode.funcstate.needs_refnanny = nanny/True, without further conditions.
That would be better - I'd missed that as an option. I'll do that
But the question remains if we need to decide it here, at the very bottom of the code generation. Isn't it enough to check the types of named variables and temp variables after generating the function? Or, if we have local (parallel?) temp scoping for some reason, let the code funcstate remember whether it created any Python object temps along the way, and check for that flag.
That's what we did before, and it mostly works. I think we'd mainly need to add a flag that things creating local Python variables can set. Where our current code fails a little is for Python object arguments that aren't reassigned (e.g. self quite frequently) which don't need reference counting, but get considered like they do. The nice thing about just tracking all decref/gotref/incref/etc is that it should give the right answer.
Cython/Compiler/PyrexTypes.py
Outdated
| if code.funcstate: | ||
| code.funcstate.needs_refnanny = code.funcstate.needs_refnanny or nanny |
There was a problem hiding this comment.
| if code.funcstate: | |
| code.funcstate.needs_refnanny = code.funcstate.needs_refnanny or nanny | |
| if code.funcstate and nanny: | |
| code.funcstate.needs_refnanny = True |
There was a problem hiding this comment.
funcstate looks to be unset in the module deallocation functions. nanny is set to false though, so testing it first is enough to not need the funcstate check I think.
Cython/Compiler/PyrexTypes.py
Outdated
|
|
||
| def generate_incref(self, code, cname, nanny): | ||
| if nanny: | ||
| if code.funcstate: |
There was a problem hiding this comment.
In which case do we not have a funcstate? Maybe we should get rid of that case instead?
I can't think of a place where we generated increfs outside of user/wrapper functions or the module init function.
| // Note - I'm using a 5 second timeout here to avoid the test | ||
| // being "deadlock forever". This could occasionally give false | ||
| // failures if things are running really slowly |
There was a problem hiding this comment.
Yeah, that would be annoying. I dislike uncontrolled concurrency in tests for pretty much this reason. Threads are one of those tools that programmers easily consider appropriate and that rarely are. Especially if test failures necessarily make the code run into timeouts, rather than failing timely.
Some people, when confronted with a problem, say "I know, I'll use threads". And now problems two have they. :)
Can't we use a C code validation check that errors if it finds any refnanny usage before the module init function, or something like that? That seems doable with a regex. (Which, admittedly, is the other tool which gives you two problems from one for free…)
We could also extend the C code validation to allow start and end markers, to match a specific function or section. sed uses /startregex/,/endregex/ command, for example, where start and end are optional. For our use case, we could use something like /startregex/,/endregex/ innerregex, and fail if the inner regex is (not) found, as well as if start or end are not found (because we always expect those to be present either way).
There was a problem hiding this comment.
Improved C code validation is implemented in 09bb9e1
Pattern format is /start/:/end/pattern, or something like /start/ : /end/ pattern if you prefer adding whitespace. Both start and end are optional.
There was a problem hiding this comment.
I'll add a C code validation test (but not today...).
I'd like to leave this test in though - I think it is the "right" test and making sure the actual problem doesn't recur.. I think the only thing that is non-deterministic is the slight chance it takes more than 5 seconds. But realistically that's a huge overestimate of the amount of time it needs
There was a problem hiding this comment.
I've added some C code tests. I think we should skip the more complicated patterns on annotated HTML though - they seem almost impossible to make work there
|
Thanks for fixing this. It's sad to have performance regressions in Cython 3.0 compared to 0.29.x that keep people from adopting it. |
|
There is a bug in the logic for closure functions. We generate the "ensure GIL" part (due to |
Resolved in #5738 |
Originally reported in scikit-learn/scikit-learn#27086
Essentially:
I've fixed it by avoiding using refnanny in parallel blocks unless we know it's available. I think it's too hard to tell reliably ahead of time (at least for me).
I think the
have_object_args and has_with_gil_blockcriteria for acquiring the GIL around function definitions might be excessive, and possibly unnecessary some of the time. It's possibly mainly reassigned object args. This might be worth refactoring but I don't think it's necessary to fix this particular performance regression.Closes #5700