Fix stack overflow detection with systhreads by stedolan · Pull Request #8670 · ocaml/ocaml

stedolan · 2019-05-14T15:18:35Z

Stack overflow detection doesn't work properly with systhreads, as shown by #7490. A fix was proposed a while ago in #1062, but stalled. This PR implements a simpler approach.

Instead of trying to work out the stack dimensions by querying OS interfaces, this patch defines a stack overflow as a SIGSEGV that happens on an address that's between the top of the stack and the current stack pointer (or slightly past the stack pointer, to handle some edge cases around function prologues).

There are a couple of changes:

calls to getrlimit are removed from the signal handler, inspecting the stack pointer instead.
all systhreads now reinitialise the sigaltstack (but not the signal handler) on startup. ~~POSIX says you should have to do the former but not the latter, on Linux it seems you need both~~ (this is nonsense, I was testing it badly. Reinitialising sigaltstack and leaving the handler alone suffices). All threads share the same sigaltstack - they won't collide unless there's a segfault in C code, in which case all bets are already off.
the stack probe limit on amd64 was lowered from 32k to 4k. Linux adds a 4k guard page at the end of thread stacks to catch overflow, but beyond that is free space. It is quite common (i.e. happened regularly in testing) for that space to be used by some other allocation. So, a stack probe of 32k from the current stack pointer has false positives: it crosses the guard page and touches some random bit of memory. A 4k probe cannot have false positives: it either lies in the stack or in the guard page.
caml_get_exception_raw_backtrace no longer allocates an 8K array on the stack. (This is an independent change, in a seperate commit. Including it here makes it safe to call Printexc.print_backtrace from a Stack_overflow handler).

This has so far only been tested on Linux amd64, but I've added CONTEXT_SP support for other platforms. Hopefully it will work everywhere?

stedolan · 2019-05-14T15:19:10Z

cc @mshinwell, @xavierleroy

runtime/signals_nat.c

xavierleroy

Nice to see this work restart. A couple of questions below.

runtime/signals_nat.c

runtime/backtrace.c

gasche · 2019-05-15T08:57:13Z

846ce83 (the change to runtime/backtrace.c to use caml_alloc_shr instead of caml_alloc, and get rid of a large on-stack buffer) looks like a very nice change on its own -- depending on how fast you think the present PR is going to get a decision, it may make sense to send it as a separate pre-PR.

let-def · 2019-05-15T12:12:29Z

@gasche @stedolan: Doesn't this risk to put a lot of short-lived values in the major heap? I can see this happening when get_raw_backtrace is used in conjunction with reraise_with_backtrace.

gasche · 2019-05-15T12:59:50Z

When I first thought about the change, my idea was that callstacks are typically (1) fairly large (often larger than Max_young_wosize?) and (2) likely to get stored later for auditing, and thus long-lived, or to come just before the program errors out, and thus we don't care.

(I had another paragraph here which I guess was confused about when backtraces are generated by the compiler when using exceptions for control-flow, and people are not using raise_notrace. I need to remind myself of how that actually works or let someone else comment on that part -- it could be indeed that this use-case generates a lot of backtraces and discards most of them immediately, but maybe not.)

stedolan · 2019-05-15T14:18:36Z

The situation with exceptions are used for control flow, where backtraces are unnecessarily generated and then discarded, is indeed annoyingly common. The importance of performance in this situation is why caml_stash_backtrace still has a static global buffer.

This change affects only Printexc.get_raw_backtrace (and a few other functions in Printexc), which use a seperate codepath from caml_stash_backtrace. I don't think this is used anywhere performance-sensitive.

gasche · 2019-05-15T14:26:57Z

Will the backtraces collected by statmemprof be affected? (Those are expected to follow the lifetime of the memory block they track, so we can expect that many of them will be short-lived.)

stedolan · 2019-05-15T14:47:48Z

No, statmemprof does not use this mechanism to collect callstacks. Statmemprof is interested in the current callstack, so it walks the stack to collect it. These functions concern the backtrace of the most recent exception, which is stored in a global buffer.

stedolan · 2019-05-15T15:51:12Z

I changed this patch to remove the sharing of sigaltstacks. The reasoning why this was OK was based on considering only SIGSEGV; there are other signals and they share a stack.

xavierleroy · 2019-05-15T17:13:53Z

This change affects only Printexc.get_raw_backtrace (and a few other functions in Printexc), which use a seperate codepath from caml_stash_backtrace.

This is true.

I don't think this is used anywhere performance-sensitive.

This is perhaps less true these days. For example, Fun.protect calls Printexc.get_raw_backtrace if an exception occurs. I would still hope that Fun.protect is not used in conjunction with functions that raise often, but who knows...

gasche · 2019-05-16T08:35:47Z

I looked for usage of Misc.try_finally (a parent of Fun.protect)
in the compiler codebase, I have the impression that all
use-cases concern blocks where the exception path is indeed
exceptional. It is also used at a fairly coarse
granularity (typically, the cleanup action is to remove
a temporary file), so I think additional costs would be
amortized.

I haven't seen a use of Fun.protect around exceptions used for
control-flow, but that seems entirely possible to me. (For example,
a SAT solver where the "restart" action, every N decision steps,
is implemented by raising an exception, may use Fun.protect to unwind
the decision stack.) I think it would make sense to also offer a
Fun.protect_notrace function, cousin of raise_notrace, that does
not capture the call stack and uses a reraise_notrace.

Let me know if you like the idea and I'll shoot a separate PR or issue report.

stedolan · 2019-05-21T13:12:34Z

I think it would make sense to also offer a
Fun.protect_notrace function, cousin of raise_notrace, that does
not capture the call stack and uses a reraise_notrace.

Currently (both before and after this patch), if raise_notrace is used, then Printexc.get_raw_backtrace is very fast (it does not allocate at all, since there is no backtrace). So I don't think this function is necessary. Effectively, Fun.protect does a reraise_notrace if the original exception came from raise_notrace, which I think is the right behaviour.

gasche · 2019-05-21T17:15:23Z

Ah, indeed. Sorry for the internal confusedness.

gadmm · 2019-05-23T21:42:42Z

I looked for usage of Misc.try_finally (a parent of Fun.protect)
in the compiler codebase, I have the impression that all
use-cases concern blocks where the exception path is indeed
exceptional. It is also used at a fairly coarse
granularity (typically, the cleanup action is to remove
a temporary file), so I think additional costs would be
amortized.

In the following review of uses of try_finally in the compiler: #2118 (comment), I have found several cases where it is used for expressing temporary mutation, where the finally just rolls back some state. This is an idiomatic use of unwind-protect, and is a use-case that can occur in algorithmically-intensive situations (e.g. if exceptions are used for control flow). It seems incorrect to me to assume in general that resources are few and disposal is expensive (amortizes the cost of recording the stack trace).

I haven't seen a use of Fun.protect around exceptions used for
control-flow, but that seems entirely possible to me. (For example,
a SAT solver where the "restart" action, every N decision steps,
is implemented by raising an exception, may use Fun.protect to unwind
the decision stack.)

Yes, this is a valid use-case, actually I have found 5 of them in the aforementioned audit. Control flow is the intended usage of Fun.protect, when distinguished from bracket-style resource management that would do masking for async exceptions. This use-case was in fact the reason put forward in favour of including Fun.protect with its current type, whatever the fate of asynchronous exceptions would be.

Out of curiosity, how expensive is it to call Printexc.get_raw_backtrace immediately after Printexc.raise_with_backtrace, for instance in the case where the user has nested Fun.protects?

stedolan · 2019-08-27T11:28:27Z

This discussion got side-tracked by some questions about whether allocating backtraces on the major heap would have a noticeable performance impact.

I've just rebased this patch to trunk. On trunk, since #8691, plain caml_alloc should not trigger callbacks. So, the new version of this patch just calls caml_alloc (which allocates on the minor or major heap as appropriate).

However, when rebasing this I noticed a bug in #8691, in that it does not always prevent caml_alloc from triggering callbacks. This bug is fixed in #8897, so we shouldn't merge this patch until after #8897.

caml_alloc is now used allocate backtraces, which is safe now that caml_alloc can't call async callbacks.

stedolan · 2019-09-16T12:41:15Z

#8897 is merged, and this branch has just passed precheck (300), so I think this is ready to merge.

gadmm · 2019-09-24T21:41:23Z

The second commit about removing the stack-allocated buffer looks good to me. (I would happily help with a review of the first commit but the technical details still fly over my head.)

runtime/backtrace.c

Changes

stedolan · 2019-10-11T07:31:07Z

I'd quite like to get this into 4.10. Are there any remaining issues?

xavierleroy

I had a second look at the code and at the discussion. I agree it's good to go.

stedolan · 2019-10-11T08:58:09Z

Thanks!

gadmm · 2019-10-11T17:37:24Z

I hope it's the right place to report it: I have a bunch of errors like the following one when I do make depend inside the runtime directory:

gcc -MM -D_FILE_OFFSET_BITS=64 -D_REENTRANT -DCAML_NAME_SPACE   *.c | \
  sed -e 's/\([^.]*\)\.o/\1_b.$(O)/' > .depend
signals_nat.c:192:2: error: #error "CONTEXT_SP is required if HAS_STACK_OVERFLOW_DETECTION is defined"
 #error "CONTEXT_SP is required if HAS_STACK_OVERFLOW_DETECTION is defined"
  ^~~~~

Since the new stack overflow detection system (ocaml#8670), ASAN was confused. We fix the issue by resetting the 'use_sigaltstack' flag and by disabling ASAN on functions used by the stack overflow handler.

Since the new stack overflow detection system (#8670), ASAN was confused. We fix the issue by telling ASAN to not use a sigaltstack, and by disabling ASAN on functions used by the stack overflow handler.

Since the new stack overflow detection system (#8670), ASAN was confused. We fix the issue by telling ASAN to not use a sigaltstack, and by disabling ASAN on functions used by the stack overflow handler. (cherry picked from commit 5a53560)

@sadiqj

To avoid allocating on the stack, keep the runtime lock so that OCaml objects are not moved around. This is needed because thread stack are very small on musl<1.1.21 (80k). Allocating a buffer of 64k bytes on the stack means that the stack overflow detection (`caml_c_call`) won't have enough space to allocate 32k and will segfault: see ocaml/ocaml#7490 for a similar bug report. `caml_c_call` has been properly fixed in behttps://github.com/ocaml/ocaml/pull/8670 and it also has been changed to only allocate 4k -- also solving our initial issue with too small thread stacks (as 64+8<80). Many thanks to @sadiqj for helping with debugging the issue. Co-authored-by: Frédéric Bour <fred@tarides.com>

@sadiqj

To avoid allocating on the stack, keep the runtime lock so that OCaml objects are not moved around. This is needed because thread stack are very small on musl<1.1.21 (80k). Allocating a buffer of 64k bytes on the stack means that the stack overflow detection (`caml_c_call`) won't have enough space to allocate 32k and will segfault: see ocaml/ocaml#7490 for a similar bug report. `caml_c_call` has been properly fixed in behttps://github.com/ocaml/ocaml/pull/8670 and it also has been changed to only allocate 4k -- also solving our initial issue with too small thread stacks (as 64+8<80). Many thanks to @sadiqj for helping with debugging the issue. Co-authored-by: Frédéric Bour <fred@tarides.com>

Stack overflow detection is fixed in 4.10 at ocaml#8670.

Stack overflow detection is fixed in 4.10 at ocaml#8670. No changes needed.

avsm reviewed May 14, 2019

View reviewed changes

runtime/signals_nat.c Show resolved Hide resolved

xavierleroy reviewed May 14, 2019

View reviewed changes

runtime/signals_nat.c Outdated Show resolved Hide resolved

runtime/backtrace.c Outdated Show resolved Hide resolved

stedolan force-pushed the stack-overflow-fix branch from 74efb3b to e1d6305 Compare May 15, 2019 15:37

stedolan changed the title ~~[WIP] Fix stack overflow detection with systhreads~~ Fix stack overflow detection with systhreads May 15, 2019

stedolan mentioned this pull request May 22, 2019

Unmarshalling code is not reentrant #8674

Closed

stedolan force-pushed the stack-overflow-fix branch from 2c8f4f5 to 5b05557 Compare August 27, 2019 11:23

stedolan force-pushed the stack-overflow-fix branch 2 times, most recently from 30da9b1 to 104ce5e Compare August 27, 2019 13:56

stedolan added 2 commits August 27, 2019 14:58

Fix stack overflow detection with systhreads

8b20b69

Remove excessive stack usage in caml_get_exception_raw_backtrace

f3c650b

caml_alloc is now used allocate backtraces, which is safe now that caml_alloc can't call async callbacks.

stedolan force-pushed the stack-overflow-fix branch from 104ce5e to f3c650b Compare August 27, 2019 13:58

gasche mentioned this pull request Sep 24, 2019

Do not raise in Printexc.get_raw_backtrace #8964

Closed

gadmm reviewed Sep 28, 2019

View reviewed changes

runtime/backtrace.c Show resolved Hide resolved

gadmm reviewed Sep 28, 2019

View reviewed changes

Changes Show resolved Hide resolved

xavierleroy approved these changes Oct 11, 2019

View reviewed changes

xavierleroy merged commit d29a01e into ocaml:trunk Oct 11, 2019

jhjourdan mentioned this pull request Oct 18, 2019

Fix 'extra-check' INRIA CI #9053

Merged

This was referenced Oct 20, 2019

Fix stack overflow detection (MPR#7490) #1062

Closed

Segfault upon stack overflow in child thread. #7490

Closed

samoht mentioned this pull request Nov 20, 2019

Do not allocate 64k bytes buffers on the thread stack mirage/index#132

Merged

gadmm added a commit to gadmm/ocaml that referenced this pull request Jan 28, 2020

Documentation: Stack_overflow is reliable, Out_of_memory is not.

79844ab

Stack overflow detection is fixed in 4.10 at ocaml#8670.

gadmm mentioned this pull request Jan 28, 2020

Documentation: Stack_overflow is reliable, Out_of_memory is not. #9270

Merged

gadmm added a commit to gadmm/ocaml that referenced this pull request Jan 28, 2020

Documentation: Stack_overflow is reliable, Out_of_memory is not.

432fd9a

Stack overflow detection is fixed in 4.10 at ocaml#8670. No changes needed.

xavierleroy mentioned this pull request May 8, 2020

wrong stack_size when on another thread #7527

Closed

This was referenced Oct 25, 2021

Backport glibc 2.34 fix to 3.07-4.12 ocaml/opam-repository#19855

Merged

Dynamically allocate the alternate signal stack on 4.09 and earlier #10725

Closed

Conversation

stedolan commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stedolan commented May 14, 2019

Uh oh!

Uh oh!

xavierleroy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gasche commented May 15, 2019

Uh oh!

let-def commented May 15, 2019

Uh oh!

gasche commented May 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stedolan commented May 15, 2019

Uh oh!

gasche commented May 15, 2019

Uh oh!

stedolan commented May 15, 2019

Uh oh!

stedolan commented May 15, 2019

Uh oh!

xavierleroy commented May 15, 2019

Uh oh!

gasche commented May 16, 2019

Uh oh!

stedolan commented May 21, 2019

Uh oh!

gasche commented May 21, 2019

Uh oh!

gadmm commented May 23, 2019

Uh oh!

stedolan commented Aug 27, 2019

Uh oh!

stedolan commented Sep 16, 2019

Uh oh!

gadmm commented Sep 24, 2019

Uh oh!

Uh oh!

Uh oh!

stedolan commented Oct 11, 2019

Uh oh!

xavierleroy left a comment

Choose a reason for hiding this comment

Uh oh!

stedolan commented Oct 11, 2019

Uh oh!

gadmm commented Oct 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

stedolan commented May 14, 2019 •

edited

Loading

gasche commented May 15, 2019 •

edited

Loading