Skip to content

Reorder call to caml_reset_young_limit and tighten assertion#12875

Open
mshinwell wants to merge 1 commit intoocaml:trunkfrom
mshinwell:upstream-pr2204
Open

Reorder call to caml_reset_young_limit and tighten assertion#12875
mshinwell wants to merge 1 commit intoocaml:trunkfrom
mshinwell:upstream-pr2204

Conversation

@mshinwell
Copy link
Copy Markdown
Contributor

Here is a nice straightforward one for the New Year!

In this PR on the Flambda 2 repo, which is 5.1.1 plus a few backports, I describe a bug where this assertion fails:

void caml_reset_young_limit(caml_domain_state * dom_st)
{
  CAMLassert ((uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger);

Note that this branch does not contain #12742 which changed this assertion to be >=. Neither does it contain #12824 which made the same change after #12382 inadvertently reverted it to >.

I believe the bug happens because caml_domain_external_interrupt_hook, which can allocate e.g. if systhreads is used, is called before caml_reset_young_limit. Please see the first link above for more details. This PR aims to fix the bug on trunk in the same way.

Now, from what I understand from @jmid , #12742 was introduced to fix the above assertion failure which had also been seen (suspiciously!) only when running systhreads tests. Unfortunately I suspect this was the wrong diagnosis and it was actually the problem with the hook being seen. (The equalities between the various GC pointers aren't the same as in my first PR above at the point of failure, but I think it is very likely another form of the same problem.)

Let's suppose for a moment it isn't the hook at fault and examine the situation in #12742 again. For caml_poll_gc_work to set young_trigger = young_start itself and to have young_ptr = young_trigger at the time of the assertion, it seems like you need an allocation large enough such that it is performed when the trigger is still at the midpoint (otherwise caml_poll_gc_work would have forced a minor collection), but it fills the heap entirely (and exactly). (Call this Condition 1.) I don't believe that could happen with the parameters given on #12742: the maximum object size is 256 words + 1 header word, so that isn't capable of overflowing from the first half of the heap to exactly the very end, if there are 1024 words in the heap.

I discussed this with @jmid who reported today: "To further confirm, this morning I cloned a fresh 5.1.1, reverted #12742 and moved caml_reset_young_limit as in PR flambda-backend/2204 and reran the multicoretests test suite without any issues (well, technically there was an instance of a very long running test but that is an existing, known issue)".

Given this, and the fact that the strict > in the assertion has caught an obscure bug, I would prefer to see this assertion reverted to > on trunk. That is assuming my reasoning is correct, of course, which it may well not be.

One problem remains: what happens if the minor heap is so small that an allocation can produce Condition 1 above? I think this might mean that the assertion can be reached correctly with the equality holding. What I've done in this PR is to simply disallow this situation by increasing the minimum size of the minor heap. It is still very small (512 words) which seems fine, even for test cases that deliberately reduce the heap size -- but to my mind, it makes this extraordinarily subtle code rather easier to reason about.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jan 3, 2024

#12742 was reviewed by @gadmm who mentioned that the assertion could also be hit when using statmemprof (which changes the young limit). How does that factor into the current reasoning?

@mshinwell
Copy link
Copy Markdown
Contributor Author

Ah yes - I meant to mention statmemprof. I'm not sure what the situation is here - @NickBarnes should be able to advise.

@mshinwell
Copy link
Copy Markdown
Contributor Author

@damiendoligez your thoughts would be welcomed!

@mshinwell
Copy link
Copy Markdown
Contributor Author

#12879 also relates to the hook mentioned here, fwiw.

@NickBarnes
Copy link
Copy Markdown
Contributor

This looks right to me - certainly we should reset the young_limit before we call the hook, and plausibly either version of the assertion could fail if we don't.

I am less clear that we should rule out equality in the assertion, but I think that we only reset the limit at times when we know what young_trigger is (e.g. because we have just set it). Clearly we'd benefit overall from more documentation and checking of the invariants here. When in Caml, and no actions are pending, young_ptr >= young_limit (*). That is, you can allocate down to and including the word pointed to by the trigger, without triggering GC - equality is permitted. It's only when young_ptr becomes less than young_limit that we enter the runtime.

(*) More exactly, alloc_ptr >= trigger, where alloc_ptr is either young_ptr or the copy of it in a register in register-rich architectures, and trigger is young_trigger or memprof_young_trigger, copied into young_limit. When actions are pending, young_limit gets set to UINTNAT_MAX to force the next allocation test to enter the runtime.

gadmm added a commit to gadmm/ocaml that referenced this pull request Jan 5, 2024
- Fix two bugs reported by Mark Shinwell: ocaml#12879 and ocaml#12875
- Reasoning about multithreaded code: it is now safe to assume that no
  mutation happens on the same domain after allocating on the OCaml
  from C. (This reverts to OCaml 4 behaviour.)
@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jan 5, 2024

My interpretation of this issue and the one at #12879 is that one should definitely not call caml_thread_yield inside caml_poll_gc_work. I propose instead to move the code responsible for calling caml_thread_yield into caml_do_pending_actions_exn, in order to treat it as a delayable pending action (which is meant to run OCaml code). This should fix both bugs at once. I opened the PR #12882 to explore this approach.

Note that in OCaml 4, systhread preemption used to be implemented with signals, and thus delayed to a safe point, as in #12882. This also ensured some property for reasoning in FFI code about concurrent mutations.

Thanks for the digging up these issues.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jan 5, 2024

Clearly we'd benefit overall from more documentation and checking of the invariants here.

OK, the invariant for calling caml_reset_young_limit could be tightened, as currently it is meant to be callable anytime, while I am not sure we make use of this freedom. What is not clear to me is, how can we specify it in a way that it is clear that young_ptr != young_limit holds (with memprof in mind)?

Also, I am all for adding assertions that helped catch bugs in practice, but the bug is that caml_reset_young_limit was not called when it should have been, whereas checking young_ptr != young_limit seems to ensure the opposite, that caml_reset_young_limit is not called when it should not be. Is this the right assertion for this bug or could another assertion be more relevant?

Copy link
Copy Markdown
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to remove what is related to the assertion tightening, to have the PR included quickly as a fix.

I am curious about continuing the discussion for clarifying invariants about when caml_reset_young_limit could/should be called.

@damiendoligez
Copy link
Copy Markdown
Member

I agree with everyone that moving the call to caml_reset_young_limit is a bug-fix. Instead of moving it up a few lines, I would even move it all the way up to right after the assignement to young_trigger because it's morally part of this assignment.

As for the assertion, I don't think you can tighten it safely. As hinted by @NickBarnes and @gadmm, I think that would interfere with statmemprof because an assignment to young_trigger is not the only possible reason for calling caml_reset_young_limit.

Consider the following scenario:

  • young_ptr is close to young_trigger and memprof_young_trigger is between the two. You have young_limit == memprof_young_trigger
  • an allocation of exactly the right size occurs and sets young_ptr == young_trigger. This allocation triggers caml_memprof_renew_minor_sample, which calls caml_reset_young_limit without changing young_trigger, so you still have young_ptr == young_trigger at the start of caml_reset_young_limit.

While I haven't looked at the new statmemprof code, the invariant here is really that young_ptr >= young_trigger and you shouldn't count too much on the calling context of caml_reset_young_limit for extra conditions.

gadmm added a commit to gadmm/ocaml that referenced this pull request Feb 9, 2024
- Fix two bugs reported by Mark Shinwell: ocaml#12879 and ocaml#12875
- Reasoning about multithreaded code: it is now safe to assume that no
  mutation happens on the same domain after allocating on the OCaml
  from C. (This reverts to OCaml 4 behaviour.)
@NickBarnes
Copy link
Copy Markdown
Contributor

I think @damiendoligez's analysis is right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants