Skip to content

Fix signals_alloc test on PPC#9814

Merged
nojb merged 1 commit intoocaml:trunkfrom
stedolan:fix-signals-alloc-test
Aug 2, 2020
Merged

Fix signals_alloc test on PPC#9814
nojb merged 1 commit intoocaml:trunkfrom
stedolan:fix-signals-alloc-test

Conversation

@stedolan
Copy link
Copy Markdown
Contributor

The PowerPC ports (at least) ignore signals arriving during a @noalloc C call until the next non-@noalloc C call because it caches young_limit in a register. See this comment from signals.c:

  /* When this function is called without [caml_c_call] (e.g., in
     [caml_modify]), this is only moderately effective on ports that cache
     [Caml_state->young_limit] in a register, so it may take a while before the
     register is reloaded from [Caml_state->young_limit]. */
  Caml_state->young_limit = Caml_state->young_alloc_end;

This means that #9802 broke the signals_alloc test on this architecture, by marking a signal-raising function noalloc, and checking that signals were handled quickly afterwards.

cc @nojb

@stedolan
Copy link
Copy Markdown
Contributor Author

The signal caching behaviour is pretty suspicious for multicore as well, which relies on being able to interrupt another domain by poking young_limit. I wonder whether the new efficient access to runtime variables via Caml_state makes this caching unnecessary?

@stedolan
Copy link
Copy Markdown
Contributor Author

(precheck 474 now running)

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Jul 30, 2020

Thanks! Indeed, arm64 also caches the allocation pointer in a register, and so does riscv (where I imitated arm64).

@xavierleroy
Copy link
Copy Markdown
Contributor

Note that handle_signal in runtime/signals_nat.c updates the register that caches young_limit when it is safe to do so, i.e. when running OCaml compiled code. So, the problem is only with signals that arrive (or are generated) while running C code.

The signal caching behaviour is pretty suspicious for multicore as well, which relies on being able to interrupt another domain by poking young_limit.

That's something to discuss. I thought it was using... signals!

I wonder whether the new efficient access to runtime variables via Caml_state makes this caching unnecessary?

On RISC processors the caching makes the allocation sequence one instruction shorter, besides saving one load, of course. This is good for performance. I guess some benchmarking is in order.

@xavierleroy
Copy link
Copy Markdown
Contributor

Indeed, arm64 also caches the allocation pointer in a register, and so does riscv (where I imitated arm64).

When the processor gives you 32 or more integer registers, I think it's worth (performance wise) dedicating one register to the allocation limit. That's a judgment call, not supported by actual benchmarks.

Copy link
Copy Markdown
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted, I should have noticed that in #9802. This looks good to me.

@nojb nojb closed this Jul 31, 2020
@nojb nojb reopened this Jul 31, 2020
@nojb
Copy link
Copy Markdown
Contributor

nojb commented Jul 31, 2020

@avsm
Copy link
Copy Markdown
Member

avsm commented Aug 2, 2020

Can confirm this also fixes the signals_alloc failure on arm64 for me, and it didn't occur on arm32 before either (since that doesn't have a spare register for the allocation pointer).

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Aug 2, 2020

There seems to be a problem with INRIA CI's arm machines, but I would like to go ahead and merge this, based on PR review, confirmation by @avsm, and the fact that INRIA CI passes for ppc.

@nojb nojb merged commit 0bf255c into ocaml:trunk Aug 2, 2020
@stedolan
Copy link
Copy Markdown
Contributor Author

stedolan commented Aug 3, 2020

The signal caching behaviour is pretty suspicious for multicore as well, which relies on being able to interrupt another domain by poking young_limit.

That's something to discuss. I thought it was using... signals!

At the moment, it just pokes young_limit, without actually sending a Unix signal. (Signals are slow and don't work on Windows). We could easily change this to send an actual signal, perhaps only on those ports that cache young_limit. Benchmarking needed!

I wonder whether the new efficient access to runtime variables via Caml_state makes this caching unnecessary?

On RISC processors the caching makes the allocation sequence one instruction shorter, besides saving one load, of course. This is good for performance. I guess some benchmarking is in order.

When the processor gives you 32 or more integer registers, I think it's worth (performance wise) dedicating one register to the allocation limit. That's a judgment call, not supported by actual benchmarks.

As an equally unsupported guess: the load, compare and branch of the allocation sequence are off the critical path, and the branch is easily predicted, so on an out-of-order processor they won't slow anything down. (The code size of the allocation sequence seems important, though)

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Aug 3, 2020

Perhaps another possibility on the architectures that cache the young limit: rely on the polling point insertion that is already going to be used to ensure that non-allocating sections can still be interupted.

@stedolan
Copy link
Copy Markdown
Contributor Author

stedolan commented Aug 3, 2020

That doesn't help here: if the polling points are checking a value cached in a register, we'd need to somehow update that value when the domain is interrupted. Only signal handlers give us a way to poke another thread's registers. (Well, that or ptrace nonsense).

@xavierleroy
Copy link
Copy Markdown
Contributor

A possible compromise: allocation and polling points reload the allocation limit from the state into the dedicated register by default, but allocations that are dominated (in CFG parlance) by another allocation or polling point don't, because the dedicated register already contains a fresh enough value for the allocation limit. Not sure this is worth the extra complexity, though.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Aug 3, 2020

What I meant was really Xavier's suggestion. The polling points insertion code gives us logic to ensure that you don't go unbounded time without hitting such a point -- we can use that logic to make sure to check the global often enough and otherwise use a register.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants