Skip to content

Simplifications/cleanups/clarifications from multicore review#10966

Merged
sadiqj merged 7 commits intoocaml:trunkfrom
gadmm:multicore_review_no_changes
Feb 10, 2022
Merged

Simplifications/cleanups/clarifications from multicore review#10966
sadiqj merged 7 commits intoocaml:trunkfrom
gadmm:multicore_review_no_changes

Conversation

@gadmm
Copy link
Copy Markdown
Contributor

@gadmm gadmm commented Jan 28, 2022

This PR is best read commit-by-commit. I grouped them together because they are all minor changes coming from my review of the multicore PR and are expected to introduce no behaviour change (and I think will interest @kayceesrk and @sadiqj).

  • As discussed, document caml_modify. Confusion stems from the fact that the PLDI paper describes a publication scheme based on the old concurrent collector design, so we document the difference with the paper. (As a side note, have the benchmarks from the paper been re-run with the release store instead of a relaxed store? If there is an impact then one could try optimizations based on the write barrier.)
  • Remove some dead code.
  • Make some currently-public APIs private.
  • Simplify/restore the APIs for signals/actions. Regarding the public API this is necessary work for 5.0, it prepares trunk for a 5.0 release in case Meta-issue on asynchronous action handling in multicore #10915 does not move fast enough. The rest is preliminary work to solving the issue.

I believe that this PR will not need too many discussions, but if you want me to split any part of it to a different PR do not hesitate to request it.

Each commit has a detailed commit log that gives more details.

@gadmm gadmm changed the title Multicore review no changes Simplifications/cleanups/clarifications from multicore review Jan 28, 2022
Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the attention you gave to the runtime system code. I read commit by commit. The first 5 look obviously correct, and the last 2 are a bit above my grade but I trust you that they do not change behavior yet. So, I'm approving this PR but would prefer to have a second review.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jan 30, 2022

I've started reading through and I can finish the review of by mid-week.

One thing I am not sure about is whether it's the right thing to reintroduce caml_process_pending_actions_exn. While the reimplementation complies with the _exn bit, it doesn't actually process pending actions - only signals. If there is code that has long-running C code that relies on that as a poll, it will compile but may perform terribly if there are multiple domains running.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jan 31, 2022

I would tend to agree with you regarding caml_process_pending_action_exn and reintroduce it in the API only commented-out. But polling for signals is barely just what is needed to avoid a big temporary rework of the Coq VM. (The other regression that affects the Coq VM is allocation calling finalisers right away, but they seem to avoid it by not using finalisers at all.) Right @silene?

However I will document the limitation.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jan 31, 2022

The 6th and 7th commit are best reviewed by comparing to pre-multicore trunk. It mainly reintroduces what was already there. Only for runtime/weak.c in the 7th I had to get creative, so you might want to proofread it twice.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jan 31, 2022

I would tend to agree with you regarding caml_process_pending_action_exn and reintroduce it in the API only commented-out. But polling for signals is barely just what is needed to avoid a big temporary rework of the Coq VM.

So my understanding from rocq-prover/rocq#15464 (comment) and the discussion afterward is that switching to caml_process_pending_signals_exn (which is effectively what actions would now do) is sufficient.

@silene
Copy link
Copy Markdown
Contributor

silene commented Jan 31, 2022

Right, there are no meaningful finalizers in Coq.

Basically, what Coq needs is an extremely cheap way of testing that a signal (namely sigint) is pending. By extremely cheap, I mean that, ideally, it should be nothing more than reading a single word in memory. If the test is positive, then we ask the runtime to trigger the resolution of the signal. The only thing we need is that, whatever the runtime performs at this point, if an exception is raised (and for sigint, it is), we have a way to intercept the exception to cleanup after ourselves before reraising the exception.

Whether we detect only pending signals or other pending events, we do not care, as long as the test is cheap and has no false positive. And if the runtime decides to perform a lot more work than just resolving the pending signal, that is also fine with us.

@silene
Copy link
Copy Markdown
Contributor

silene commented Jan 31, 2022

And just to be complete, handling signals is just one of the issues. The other is what to do with Alloc_small. Indeed, we pass as the last macro argument some C code that calls caml_process_pending_actions. Again, we do not care what the runtime performs at this point and how long it takes. We just need some way to catch any exception that might be raised.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jan 31, 2022

Right. One point is that caml_process_pending_signals_exn is internal (but Coq VM already uses internals) and we can avoid having one more preprocessor switch in the Coq VM. It's as you prefer.

Coq VM will indeed need to be changed to call Alloc_small with { caml_handle_gc_interrupt(); } as last parameter. It might be forward-compatible for OCaml 5. (caml_handle_gc_interrupt_no_async_exc is renamed into caml_handle_gc_interupt in this PR, although it currently calls finalisers in both versions; later it should not).

So there's maybe the option of not requiring more changes to the CoqVM later.

It's worth testing the performance impact of polling in the new version. caml_check_pending_actions is broken currently, do not use it. caml_process_pending_*_exn are already guarded by caml_check_for_pending_signals which is much more efficient than it used to be. caml_process_pending_actions_* is always meant to be fast when no actions are pending.

@silene
Copy link
Copy Markdown
Contributor

silene commented Jan 31, 2022

caml_check_for_pending_signals is much more efficient than it used to be

Perhaps, but it is still incredibly slower than the approach available in OCaml 4. In the old world, we could just check the value of caml_something_to_do. Now we have to execute the following function:

  atomic_thread_fence(memory_order_acquire);
  for (i = 0; i < NSIG_WORDS; i++) {
    if (atomic_load_explicit(&caml_pending_signals[i], memory_order_relaxed))
      return 1;
  }

By the way, was the performance of OCaml bytecode interpreter tested? The old one was checking caml_something_to_do several million times per second. Now that caml_check_for_pending_signals is called instead, I expect performance to be disastrous. I just tried on a simple loop computing Fibonacci. With OCaml trunk, the bytecode interpreter is 600% slower than OCaml 4.13. In fact, 75% of the time is actually wasted inside caml_check_for_pending_signals, according to perf.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jan 31, 2022

Believe it or not, this function amounts to a test of a single machine word and is compiled as such! See #10880. How recent is your benchmark?

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jan 31, 2022

I am interested in feeback regarding the perf of caml_pending_actions*, later. (OCaml ≥ 4.10 has the same issue.) If needed we can declare it as inline in some way.

@xavierleroy
Copy link
Copy Markdown
Contributor

I just tried on a simple loop computing Fibonacci. With OCaml trunk, the bytecode interpreter is 600% slower than OCaml 4.13. In fact, 75% of the time is actually wasted inside caml_check_for_pending_signals, according to perf.

That was the case a while ago, before #10880 was merged. Performance should be much better now, but can probably be improved some more by just testing young_ptr against young_limit.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jan 31, 2022

just testing young_ptr against young_limit

This would not work currently.

@xavierleroy
Copy link
Copy Markdown
Contributor

This would not work currently.

Couldn't we make it work? It's good enough for native code, so it should be doable in bytecode too.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jan 31, 2022

I might have misread your comment. For @silene it is probably not enough to test young_limit because it is reset by allocations. Eventually for OCaml's bytecode interpreter the goal is to just have this test, yes. But if the issue mentioned at #10880 (comment) has the same origin then this is fixed on the way to solving #10915 (edit: looking at it again I do not see an indication that the failing test depends on #10915; I do not mean to discourage looking at the issue independently).

@silene
Copy link
Copy Markdown
Contributor

silene commented Jan 31, 2022

How recent is your benchmark?

Indeed, my mistake. After recompiling OCaml trunk, it is not as slow, but it is still 70% slower than OCaml 4.08 and 4.13. That said, it does not seem to be due to caml_check_for_pending_signals, which is now down to 4% of the total time.

For the record, in the loop version of Fibonacci, time is spent inside caml_modify, while in the recursive version, time is spent inside caml_ensure_stack_capacity. But, while the time spent there is significant, it does not entirely explain the slowdown.

Copy link
Copy Markdown
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @gadmm, thanks for going through these. Other than the review comment on the diff, I'm still uneasy about having a caml_process_pending_actions_exn that doesn't actually process pending actions, only signals. That said the only real user we can find of the API is probably aware of the problem and we intend to fix it eventually.

if (atomic_load_acq(&Thread_main_lock.waiters) == 0) return Val_unit;

caml_process_pending_signals();
caml_process_pending_actions();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this matches the behaviour in 4.x. If we look at https://github.com/ocaml/ocaml/blob/4.14/otherlibs/systhreads/st_stubs.c#L734-L739 we explicitly only deal with signals in caml_thread_yield.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think my reasoning was that in theory with new polling locations, any signal handler could execute other actions and so the distinction is no longer relevant. But the reasoning was probably flawed and in any case not for this PR which is meant to be side-effect-free.

Some documentation comment erroneously disappeared in multicore, which encouraged to keep consistency with caml_enter/leave_blocking_section, which was useful information in this case. I restored the comment.

In light of this comment, the line Current_thread = st_tls_get(Thread_key); was added in caml_thread_enter_blocking_section but not in caml_thread_yield, so there is a now a discrepancy, please have a look.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good point. @Engil do you happen to know how we ended up with this? trunk seems to update the current thread when entering a blocking section:

Current_thread = st_tls_get(Thread_key);
but this doesn't seem to make sense? I think you must be the current thread in order to call caml_thread_enter_blocking_section.

(4.14 doesn't do this: https://github.com/ocaml/ocaml/blob/4.14/otherlibs/systhreads/st_stubs.c#L230 )

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an oversight and is not required.
Weird interactions with backup threads or else do not seem likely at all (since a thread going through caml_thread_enter_blocking_section is the thread actually holding the lock and the thread that has been executing code.).
I think it is safe to remove this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. I'll make a PR doing so.

@gadmm gadmm force-pushed the multicore_review_no_changes branch from 3a78d18 to 10edc37 Compare February 8, 2022 23:52
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Feb 8, 2022

I have to leave your remaining remark for another time.

gadmm added 4 commits February 9, 2022 22:23
No change entry needed.
Remove caml_huge_fallback_count and FIXME comment. It never worked
(has always been equal to zero) so it has no potential users, and a
grep of opam sources returns no hit. There is little risk in removing
it, and 5.0 is a good time for this.

No change entry needed.
`caml/misc.h`: This header is very commonly included and I think it is
not intended to make these functions and macros public.

`caml/platform.h`: The whole header is made a CAML_INTERNALS. It
contains non-prefixed identifiers and no function of public interest

No change entry needed.
…c_interrupt

Eventually we will only need the one that does not run finalisers,
once it is implemented.

No change entry needed.
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Feb 9, 2022

  • I have removed the commit about documenting caml_modify. Stephen is preparing us a much better explanation.
  • I have somehow fixed the implementation of caml_check_pending_actions, which happens to fix a bug in io.c (edit: hasty conclusion).
  • I have removed caml_process_pending_actions_exn for now. I have left it commented-out in the API for documentation purposes. @silene, please use the internal caml_process_pending_signals_exn in the meanwhile if OCaml 5.0 is released without this function.

Please review the last three commits (the others have not changed) and then I'll clean up the history.

@gadmm gadmm force-pushed the multicore_review_no_changes branch from 10edc37 to 44dd90e Compare February 9, 2022 21:46
Copy link
Copy Markdown
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now, thanks @gadmm

Let me know when the history is cleaned up and I'll merge.

- Restore old public API and comments.
- Document current limitations, caml_process_pending_actions_exn is
  uninmplemented for now.
- Move pending_actions back to signals.c
- Reintroduce caml_process_pending_actions_with_root for subsequent commit
- Remove superfluous functions
- Try to restore discipline of _removing_ internal functions that
  directly raise when possible (caml_process_pending_signals).

This is very preliminary work to reintroduce control over asynchronous
exceptions. The first two points are desirable for 5.0.

No behaviour change expected.

No change entry needed.
Polling locations were tightened for memprof. They partially
disappeared during multicore's rebase on top of 4.12 since multicore
did not support memprof or the new polling machinery. In addition, it
was inconsistent whether these changes and/or the comments relating
them to memprof were kept or removed.

These changes will be easy to miss when working to restore memprof
unless someone goes through the diff of the multicore PR. Since I have
already done it, this will make sure that nothing is accidentally
lost.

This commit is meant to be behaviour-preserving.

No change entry needed.
Shorten the name caml_check_for_pending_signals for consistency
@gadmm gadmm force-pushed the multicore_review_no_changes branch from 44dd90e to 0a372b2 Compare February 10, 2022 11:51
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Feb 10, 2022

It is clean now, thanks.

@sadiqj sadiqj merged commit 2beaad2 into ocaml:trunk Feb 10, 2022
@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Feb 10, 2022

Brilliant, thanks for your work on this @gadmm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants