Conversation
|
The signal caching behaviour is pretty suspicious for multicore as well, which relies on being able to interrupt another domain by poking young_limit. I wonder whether the new efficient access to runtime variables via |
|
(precheck 474 now running) |
|
Thanks! Indeed, |
|
Note that
That's something to discuss. I thought it was using... signals!
On RISC processors the caching makes the allocation sequence one instruction shorter, besides saving one load, of course. This is good for performance. I guess some benchmarking is in order. |
When the processor gives you 32 or more integer registers, I think it's worth (performance wise) dedicating one register to the allocation limit. That's a judgment call, not supported by actual benchmarks. |
|
Can confirm this also fixes the signals_alloc failure on arm64 for me, and it didn't occur on arm32 before either (since that doesn't have a spare register for the allocation pointer). |
|
There seems to be a problem with INRIA CI's arm machines, but I would like to go ahead and merge this, based on PR review, confirmation by @avsm, and the fact that INRIA CI passes for ppc. |
At the moment, it just pokes young_limit, without actually sending a Unix signal. (Signals are slow and don't work on Windows). We could easily change this to send an actual signal, perhaps only on those ports that cache young_limit. Benchmarking needed!
As an equally unsupported guess: the load, compare and branch of the allocation sequence are off the critical path, and the branch is easily predicted, so on an out-of-order processor they won't slow anything down. (The code size of the allocation sequence seems important, though) |
|
Perhaps another possibility on the architectures that cache the young limit: rely on the polling point insertion that is already going to be used to ensure that non-allocating sections can still be interupted. |
|
That doesn't help here: if the polling points are checking a value cached in a register, we'd need to somehow update that value when the domain is interrupted. Only signal handlers give us a way to poke another thread's registers. (Well, that or ptrace nonsense). |
|
A possible compromise: allocation and polling points reload the allocation limit from the state into the dedicated register by default, but allocations that are dominated (in CFG parlance) by another allocation or polling point don't, because the dedicated register already contains a fresh enough value for the allocation limit. Not sure this is worth the extra complexity, though. |
|
What I meant was really Xavier's suggestion. The polling points insertion code gives us logic to ensure that you don't go unbounded time without hitting such a point -- we can use that logic to make sure to check the global often enough and otherwise use a register. |
The PowerPC ports (at least) ignore signals arriving during a
@noallocC call until the next non-@noallocC call because it cachesyoung_limitin a register. See this comment fromsignals.c:This means that #9802 broke the signals_alloc test on this architecture, by marking a signal-raising function
noalloc, and checking that signals were handled quickly afterwards.cc @nojb