Skip to content

Ensure that GC is not invoked from bounds check failures#11303

Merged
xavierleroy merged 5 commits intoocaml:trunkfrom
stedolan:fix-gc-in-bounds-check
Jul 27, 2022
Merged

Ensure that GC is not invoked from bounds check failures#11303
xavierleroy merged 5 commits intoocaml:trunkfrom
stedolan:fix-gc-in-bounds-check

Conversation

@stedolan
Copy link
Copy Markdown
Contributor

@stedolan stedolan commented Jun 7, 2022

Since 4.12, there's a subtle bug in how array/string/bigarrray bounds checks failures are handled, which can lead to corruption or segfaults in programs that catch the exceptions arising from bounds check failures.

Different invariants apply for exceptions from C versus OCaml. Exceptions raised from C are assumed to be at a safe point: all live values are registered on the stack, caml_local_roots is in a sensible state, and so on. In particular, the GC may run if asynchronous actions are pending. Exceptions raised from OCaml, by contrast, do not have these requirements. Instead, the stack is simply cut to the point of the exception handler (possibly after collecting a backtrace).

In order to make it possible to collect a backtrace, raise sites of OCaml exceptions have a frame table entry, but this frame table entry only contains debuginfo and has an invalid set of live registers. (This shouldn't matter when raising exceptions from OCaml, as this data is not used).

However, the current implementation of bounds check failures does not raise an exception from OCaml: instead, it switches to C and calls caml_raise. This function has stronger requirements, and in particular may invoke GC if actions are pending (see #9722). If this occurs, the GC can use the invalid frame table of the bounds check error. This can cause problems if there are in fact live values, particularly when an array bounds check is raised and caught in the same function. The attached testcase demonstrates this, going wrong on all versions of OCaml since 4.12.

The fix here is to skip the C-specific processing when raising a bounds check failure in caml_array_bound_error. (Other potential fixes are to update the assembly stub to avoid ever entering C in this case, or to generate correct frame table entries detailing the live registers for all bounds check points. This change seemed simplest)


include systhreads
* hassysthreads
** not-windows
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this test relies on triggering GC via signal handlers, and I didn't think that trick worked on Windows.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If testing non-systhreads and/or windows is desired, it looks like this bug can be triggered using multiple domains (see #11425).

Copy link
Copy Markdown
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying this and a test.

Only one question, in switching to caml_raise_exception I see we're no longer resetting Caml_state->local_roots. Is there's a scenario involving caml_callback where that corrupts the list?

@xavierleroy
Copy link
Copy Markdown
Contributor

Is there's a scenario involving caml_callback where that corrupts the list?

Right, local_roots must be cleaned up. Also, the test

  if (exception_pointer == NULL) caml_fatal_uncaught_exception(v);

must be performed! Otherwise an out-of-bound array access without any try...with will crash the program instead of being reported.

@xavierleroy
Copy link
Copy Markdown
Contributor

I took the liberty to push a more prudent version directly on this branch. It adds a local caml_raise_internal function that can be instructed to skip the check for pending actions. Feel free to discard.

@stedolan
Copy link
Copy Markdown
Contributor Author

Thanks for pointing out the issue with this patch, but I think a slightly different fix would be simpler. The issue is that the requirements for raising an exception are very different when the exception is raised from C vs. from OCaml. In OCaml:

  • there must not be a poll before the raise, since the compiler does not populate frametables around raises
  • there is always an exception handler, since one is installed by caml_start_program
  • there is no need to clean up local_roots, since there are no C frames before the nearest exception handler

while when an exception is raised from C:

  • there may be a poll before the raise, since there are always populated frametables around C calls that may raise
  • there may not be an active exception handler
  • local_roots must be cleaned up before the raise

The problem is that caml_array_bound_error is sometimes called from OCaml (failed bounds checks call it) and sometimes called from C (it is exposed in caml/fail.h). The trunk implementation uses the C rules, which are incorrect for OCaml code and cause memory corruption.

My initial patch used the OCaml rules instead, by directly calling caml_raise_exception and skipping the poll / active handler check / local roots logic. While this is correct for bounds check failures in OCaml, I hadn't realised that caml_array_bound_error may also be called from C code.

The current version of this patch (thanks to @xavierleroy) manages to, I think, implement the safe common subset of the OCaml and C requirements, by not polling, checking for a handler, and optionally cleaning up local_roots. However, I think a simpler approach would be to separate the two uses, leaving caml_array_bound_error as it is (callable by C code), and introducing a new caml_ml_array_bound_error which is only invoked from ocamlopt-compiled code, not exposed in a C header, and works the way my original patch did.

(I'm about to go on vacation for a week-and-a-bit. I'd hoped to write and push another version of this patch beforehand, but oh well)

stedolan and others added 4 commits July 11, 2022 15:15
It's no longer called from a signal handler.
Even for bounds check failures, we still need to check for missing
exception handler and to adjust the local roots chain.
Divide the caml_array_bound_error function into separate C and OCaml versions.
@stedolan stedolan force-pushed the fix-gc-in-bounds-check branch from e02a3e3 to 044b591 Compare July 11, 2022 14:17
@stedolan
Copy link
Copy Markdown
Contributor Author

I've just rebased and pushed a new version of the code. This version keeps the original behaviour of caml_array_bound_error when called by C, but directly jumps to the OCaml exception handler when a bounds check failure is raised in OCaml code.

@ctk21
Copy link
Copy Markdown
Contributor

ctk21 commented Jul 12, 2022

I've taken a look. I think the proposed solution handles the problem where a GC is triggered off the bounds exception in a straight forward way.

However I'm not convinced we can ignore the frame table construction in the case where backtraces are stashed on some platforms. On ARM64 MacOS, the native compiler gives a repeating (until the buffer is filled) backtrace for the following test:

(* TEST
   flags = "-g"
   ocamlrunparam += ",b=1"
*)
let xs = [| 0; 1; 2 |]

let bad_bound_fn x =
  !x + xs.(100)

let _ =
  try
    ignore (Sys.opaque_identity (bad_bound_fn (ref 0)));
  with exn ->
    Printf.printf "Uncaught exception %s\n" (Printexc.to_string exn);
    Printexc.print_backtrace stdout;

  print_endline "OK"

The issue is that the backtrace tries to walk the stack but on ARM64 (as it stands) we are not able to do this. x86 seems (at first glance) to not give garbage in this situation.

This backtrace problem might not be for this PR to fix. This PR doesn't make the backtrace situation worse but does fix segfaults and hangs as things stand.

@xavierleroy
Copy link
Copy Markdown
Contributor

However I'm not convinced we can ignore the frame table construction in the case where backtraces are stashed on some platforms

I don't think this PR does this. In the current code base, ocamlopt -g generates a frame descriptor for each call to caml_ml_array_bound_error, precisely so that backtraces can be produced. This PR does not change this, does it?

@stedolan
Copy link
Copy Markdown
Contributor Author

The issue is that the backtrace tries to walk the stack but on ARM64 (as it stands) we are not able to do this. x86 seems (at first glance) to not give garbage in this situation.

That sounds like perhaps an independent bug in the ARM64 backend? A frame table entry is created and it should have a correct stack frame length and debug info, which is all you need to construct the backtrace. (What's missing is that it does not have a valid set of live registers, so you can't do GC at this point if the handler is in the same function)

@ctk21
Copy link
Copy Markdown
Contributor

ctk21 commented Jul 13, 2022

I don't think this PR does this. In the current code base, ocamlopt -g generates a frame descriptor for each call to caml_ml_array_bound_error, precisely so that backtraces can be produced. This PR does not change this, does it?

You've convinced me that the problem is probably in the ARM64 backend with -g and orthogonal to this PR. I'll do a bit more digging to try and track it down.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 13, 2022

My understanding of the current status:

  • @sadiqj and @xavierleroy have taken a look at a previous version
  • @stedolan rewrote the PR after feedback from @xavierleroy, but the new version has not been reviewed yet (in particular, not approved)

Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I agree with @dra27 that the test should also work under Windows, because it doesn't rely on OS signals, just on preemption between threads, right?

Comment on lines -1027 to -1028
ENTER_FUNCTION
LEA_VAR(caml_array_bound_error, %rax)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone wonders: ENTER_FUNCTION maintains %rbp as frame pointer in "with frame pointers" mode, and does nothing otherwise. I guess it would have to be matched by a LEAVE_FUNCTION just before tail-calling caml_c_call. @stedolan chose to omit the ENTER_FUNCTION / LEAVE_FUNCTION dance entirely, which is correct too and probably just as good in practice.

(If I understand correctly, the only case where it might make a difference is in frame-pointer mode, if a profiling interrupt hits while executing the LEA_VAR instruction, in which case the %rbp-based backtrace will not be accurate. Nothing to worry about.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, ENTER_FUNCTION seems required to get a correct backtrace in frame-pointer mode.

Co-authored-by: David Allsopp <david.allsopp@metastack.com>
@xavierleroy xavierleroy merged commit 9c60432 into ocaml:trunk Jul 27, 2022
xavierleroy added a commit that referenced this pull request Jul 27, 2022
* Divide the caml_array_bound_error function into separate C and OCaml versions.

The OCaml version, called from the assembly stubs, does not check for pending asynchronous actions.

* Added test testsuite/tests/lib-systhreads/boundscheck.ml

* caml_raise no longer needs a CAMLno_asan annotation

It's no longer called from a signal handler.

Co-authored-by: Xavier Leroy <xavier.leroy@college-de-france.fr>
(cherry picked from commit 9c60432)
@xavierleroy
Copy link
Copy Markdown
Contributor

Merged in trunk and cherry-picked to 5.0

@xavierleroy
Copy link
Copy Markdown
Contributor

FYI: this PR breaks frame-pointer support. The following tests produce no backtraces when the out-of-bound exception is raised.

    tests/backtrace/'backtrace_slots.ml' with 1 (native) 
    tests/backtrace/'backtrace_bounds_exn.ml' with 1 (native) 
    tests/backtrace/'raw_backtrace.ml' with 1 (native) 
    tests/backtrace/'backtrace.ml' with 1 (native) 
    tests/backtrace/'backtrace3.ml' with 1 (native) 
    tests/backtrace/'backtrace_deprecated.ml' with 1 (native) 
    tests/backtrace/'backtrace2.ml' with 1 (native) 

(CC: @fabbing)

@xavierleroy
Copy link
Copy Markdown
Contributor

Re-adding the ENTER_FUNCTION that was removed in this PR seems to fix the issue in frame-pointers mode and has no ill effects otherwise, so I'll push this fix straight to trunk.

@fabbing
Copy link
Copy Markdown
Contributor

fabbing commented Jul 28, 2022

Yes, you do need this ENTER_FUNCTION because caml_raise_exn https://github.com/stedolan/ocaml/blob/14a1d20cd00e5ff5b0b2985be12fbd55b0ba9507/runtime/amd64.S#L777 expects to have to jump over rbp to get the saved rip.
Without this instruction, the pc given to caml_stash_backtrace is wrong and it prevents us from finding the function frame_descr.

If anyone wonders: ENTER_FUNCTION maintains %rbp as frame pointer in "with frame pointers" mode, and does nothing otherwise. I guess it would have to be matched by a LEAVE_FUNCTION just before tail-calling caml_c_call. @stedolan chose to omit the ENTER_FUNCTION / LEAVE_FUNCTION dance entirely, which is correct too and probably just as good in practice.

(If I understand correctly, the only case where it might make a difference is in frame-pointer mode, if a profiling interrupt hits while executing the LEA_VAR instruction, in which case the %rbp-based backtrace will not be accurate. Nothing to worry about.)

LEAVE_FUNCTION isn't required here because caml_ml_array_bound_error tail calls into caml_c_call (after its own ENTER_FUNCION).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants