Backtrace on fatal signals.#855
Conversation
This commit adds signal handlers into main.cc, or any smaller scope as desired. For any fatal signals a backtrace is logged and then the signal is re-raised to continue with the default behavior. The backtrace logging and script are modified slightly to support all objects found and not make assumptions about the first stack frame being in the binary we care about. This is needed for reasonable output in the case of abort() or any fatal signal like SEGV that crops up inside a library function we call.
htuch
left a comment
There was a problem hiding this comment.
I really like the use of RAII to capture signal handler install/removal :)
source/exe/signal_action.h
Outdated
| * | ||
| * The signal handler must run on an alternative stack so that we can do the | ||
| * stack unwind on the original stack. Memory is allocated for this purpose when | ||
| * this object is constructed. When this object goes out of scope the memory is |
There was a problem hiding this comment.
Typo: "the memory belonging to the alternate stack"
source/exe/signal_action.h
Outdated
| * handler existing before InstallSigHandlers(). | ||
| */ | ||
| void RemoveSigHandlers() const; | ||
| std::unique_ptr<char[]> altstack_; |
There was a problem hiding this comment.
I'd suggest that this probably isn't the best way to allocate a stack for a few reasons:
- The stack won't be aligned. Many architectures (including ARMv8) expect the stack pointer to have an alignment, e.g. 64-bit.
- Ideally we would have a guard page after the stack to catch overflow.
I think (1) can be fixed fairly easily with an aligned attribute (just pick some sufficiently wide alignment).
I worry a bit about (2), since it might become a security hazard one day if an attacker figures out how to trigger a signal handler and get it to behave badly on the stack, walking over it into arbitrary heap. The typical way to add some defense here is to add a guard page, either via a linker script. Adding a linker script might take you into Bazel hell though...
You could allocate the memory with mmap and mprotect a guard page. That might be the simplest way.
So, what to do here might depend on how strongly we believe the signal handlers will behave well and are sufficiently hardened to overflow attacks.
There was a problem hiding this comment.
I think mmap of two pages, with the second page mprotect'd as a guard page is a great solution assuming there aren't any technical blockers to doing it that way.
There was a problem hiding this comment.
From the man page it is implementation defined which direction the stack grows within the memory block provided. To fit within this it may be necessary to have a guard page on either side of the memory block provided to sigaltstack.
source/exe/signal_action.cc
Outdated
| // x86_64 | ||
| error_pc = reinterpret_cast<void*>(ucontext->uc_mcontext.gregs[REG_RIP]); | ||
| #elif defined(REG_EIP) | ||
| // x86 Classic - not tested |
There was a problem hiding this comment.
Not a fan of untested code - can we just put these in comments and #warning?
There was a problem hiding this comment.
Agree, I will do that until/unless we have a way of testing them.
source/exe/signal_action.cc
Outdated
| stack.ss_size = ALTSTACK_SIZE; | ||
| stack.ss_flags = 0; | ||
|
|
||
| if (sigaltstack(&stack, nullptr) < 0) { |
There was a problem hiding this comment.
Prefer RELEASE_ASSERT per #854 for now. If sigaltstack fails, something terrible has happened.
source/exe/signal_action.cc
Outdated
| stack.ss_size = ALTSTACK_SIZE; | ||
| stack.ss_flags = 0; | ||
|
|
||
| if (sigaltstack(&stack, nullptr) < 0) { |
There was a problem hiding this comment.
You might want to zero the stack before setting it here. Best to not have random heap stuff sitting on a stack for security reasons.
There was a problem hiding this comment.
Looks like mmap with MAP_ANONYMOUS will map a zero'd page, so doing that will address this as well.
There was a problem hiding this comment.
Yeah. There's also MAP_STACK | MAP_GROWSDOWN that might be handy.
|
|
||
| void SignalAction::RemoveSigHandlers() const { | ||
| for (const auto& sig : FATAL_SIGS) { | ||
| signal(sig, SIG_DFL); |
There was a problem hiding this comment.
Since you're doing RAII, how about remembering the oldact returned by sigaction above and restoring it? Not that it matters much here, but there's something nice about the symmetry there and RAII.
There was a problem hiding this comment.
Sure, I wanted to avoid the unneeded complexity if this feature isn't needed but it isn't too difficult to implement.
There was a problem hiding this comment.
I decided to skip this unneeded complexity for now, it isn't needed by any use case in Envoy today. I can do it later if such a use case pops up.
There was a problem hiding this comment.
Sure. There's some value in it for tests, but I don't want to make this overkill.
source/exe/signal_action.cc
Outdated
| sigemptyset(&saction.sa_mask); | ||
| saction.sa_flags = (SA_SIGINFO | SA_ONSTACK | SA_RESETHAND | SA_NODEFER); | ||
| saction.sa_sigaction = SigHandler; | ||
| if (sigaction(sig, &saction, nullptr) < 0) { |
|
Quick drive by comment: can we put even the installation of the extra signals behind some build flag? I don't want to have them installed in our release builds. |
|
I think that's definitely possible, I'll check with @htuch about the proper way to do it. I was under the impression your production environment had an alternate main.cc which is why I used that as the integration point. |
|
We don't have an alternate main right now. We can make one, but IMO it would be easier to just support a define of some type that allows this feature to be turned off and on, and we can default to on for debug, maybe default to on for release, but have a way to disable if desired. |
source/exe/signal_action.cc
Outdated
|
|
||
| void SignalAction::InstallSigHandlers() { | ||
| stack_t stack; | ||
| stack.ss_sp = altstack_.get(); |
There was a problem hiding this comment.
Hang on a sec, how did this work?! :) Stacks grow down, so I think your stack was walking over the heap. Thanks for adding the guard page - when computing the initial SP, take the top address and subtract 8 bytes - that will give 64-bit alignment.
There was a problem hiding this comment.
No, I believe from the man page this is the correct way to interact with the implementation:
" On most hardware architectures supported by Linux, stacks grow
downward. sigaltstack() automatically takes account of the direction
of stack growth."
" When a signal handler is invoked on the alternate stack, the kernel
automatically aligns the address given in ss.ss_sp to a suitable
address boundary for the underlying hardware architecture."
|
(ASAN test fails because there is an invalid memory access - on purpose! - but the ASAN output when it dies is different from what my test expects. I'll investigate how to address this tomorrow) |
|
We could probably sneak the |
Switched to using mmap with mprotect()ed guard pages for the alternate signal stack. Added test code for guarded stack. Fail harder if we fail normal POSIX commands sigaction, mmap, etc. Default signal capturing behavior can be inhibited with "--define=signal_trace=disabled" build option Inhibited some tests under ASAN where the ASAN signal grabbing behavior render the test pointless and failing. tools/stack_decode.py now properly passes the return code back so test don't magically pass when --run_under Docs++
Now with fix_format actually run...
mattklein123
left a comment
There was a problem hiding this comment.
looks good from quick pass, just some style comments.
source/exe/signal_action.cc
Outdated
| #ifdef REG_RIP | ||
| // x86_64 | ||
| error_pc = reinterpret_cast<void*>(ucontext->uc_mcontext.gregs[REG_RIP]); | ||
| #elif defined(REG_EIP) |
There was a problem hiding this comment.
nit: Can we just drop all of the elif clauses and have a single warning without the commented out code?
There was a problem hiding this comment.
I'd like to retain the examples as documentation, I believe they are correct but I have no easy way to test them at this time. I'll lump it into one comment with an #else to keep it compact.
source/exe/signal_action.cc
Outdated
| stack.ss_size = ALTSTACK_SIZE; // ... guard page at the other | ||
| stack.ss_flags = 0; | ||
|
|
||
| if (sigaltstack(&stack, nullptr) < 0) { |
There was a problem hiding this comment.
nit: Just drop the if statement here and just RELEASE_ASSERT the call directly, same for all cases below. You don't need any of the error output. Line # is enough.
source/exe/signal_action.cc
Outdated
| } | ||
|
|
||
| void SignalAction::UnmapStackMemory() { | ||
| if (altstack_ != nullptr) { |
There was a problem hiding this comment.
Can this ever happen with crashing?
There was a problem hiding this comment.
Since we release assert if the mmap doesn't succeed, I believe the answer is no. I can remove the check. Inspecting the munmap man page it says "It is not an error if the indicated range does not contain any mapped pages" and although addr must be page aligned 0x0 does actually obey this restriction so it seems to be a safe no-op anyway.
source/exe/signal_action.h
Outdated
| MapAndProtectStackMemory(); | ||
| InstallSigHandlers(); | ||
| } | ||
| SignalAction(const SignalAction&) = delete; |
There was a problem hiding this comment.
nit: You can derive from NonCopyable if you want.
source/exe/signal_action.h
Outdated
| /** | ||
| * Helpers for testing guarded stack memory | ||
| */ | ||
| void DoGoodAccessForTest(); |
There was a problem hiding this comment.
nit: style guide says methods start with lower case. Same for all methods here, and also in the backtrace file. Can you fix that also?
There was a problem hiding this comment.
I will fix this, looking at Google-styled code clearly warped by brain because I was carefully capitalizing the first letters.
source/exe/signal_action.h
Outdated
| /** | ||
| * Use this many bytes for the alternate signal handling stack. | ||
| * | ||
| * This should be a multiple of 4096. |
There was a problem hiding this comment.
nit: 4096 is arch dependent. It doesn't really matter but might want to put a note that is the case. Might want to at least hide behind constant.
There was a problem hiding this comment.
Programmatically it is hidden behind GUARD_SIZE - I'll update the comment.
There was a problem hiding this comment.
mmap man page says you can use sysconf(_SC_PAGE_SIZE) to get this.
source/server/backtrace.h
Outdated
| * | ||
| * @param address The stack trace will begin from this address. | ||
| */ | ||
| void CaptureFrom(void* address) { |
There was a problem hiding this comment.
In practice, do we use any of the log in prod stuff anymore? Can we just kill it?
There was a problem hiding this comment.
Is it OK if backtraces are always logged at critical level? The point of the log_in_prod uglyness was to work around the fact that we don't actually emit logs below that level in optimized builds, and this feature should still be useful for investigating issues in optimized builds. In practice with the signal handler either something has gone very wrong, or, a developer has manually added a statement to assist in debugging. In both cases critical seems fine to me.
There was a problem hiding this comment.
Yup I would just use critical to simplify the code.
There was a problem hiding this comment.
Awesome, I will add that into this PR.
tools/stack_decode.py
Outdated
| piped_input = "" | ||
| obj_name = address_list.obj_file | ||
| # end of stack sentinel | ||
| if obj_name == "[0xffffffffffffffff]": |
There was a problem hiding this comment.
Is this from Backward? This seems somewhat architecture specific, can you generalize?
There was a problem hiding this comment.
If I make the other change below to stop before the last item then this can be removed too.
| #ifndef ASANITIZED | ||
| TEST(Signals, InvalidAddressDeathTest) { | ||
| SignalAction actions; | ||
| EXPECT_DEATH([]() -> void { |
There was a problem hiding this comment.
You could make these tests much stronger by implementing the behavior we discussed, where you call the original signal handler after dumping the stack trace and installing a signal handler before the definitions of actions. That way you could verify that the stack trace dump signal handler didn't crash itself, and possibly that it ran if you set some global state.
Anyway, it's probably overkill for this kind of feature (you already have done a lot of test stuff), but just an FYI.
There was a problem hiding this comment.
That's an interesting idea. It still isn't compatible with the ASAN issue, which seems to inhibit normal signal handling entirely, but it would make the test more robust. I might implement the test as a "fun project" in a separate PR when I have some hacking time but if you're OK with the current state of the test I'm good with merging it as it stands.
There was a problem hiding this comment.
Sure, whatever you reckon is best, I don't think we need this to merge.
source/server/backtrace.h
Outdated
| for (unsigned int i = 0; i < stack_trace_.size(); ++i) { | ||
| // Backtrace gets tagged by ASAN when we try the object name resolution for the last | ||
| // frame on stack, so skip the last one. It has no useful info anyway. | ||
| if (i + 1 != stack_trace_.size()) { |
There was a problem hiding this comment.
Could you write for (unsigned int i = 0; i < stack_trace.size() - 1; ++i) { above and skip this conditional?
There was a problem hiding this comment.
I'm a little worried that this is arch specific, and Backwards won't always return this sentinel stack frame as the last item. But, it should always be safe to skip the last item, the risk is that we don't include some relevant stack frame. I'll switch to stopping at size-1 which should guarantee no ASAN issues like I saw, at the possible risk of losing the bottom of stack which we probably don't care about anyway.
source/exe/signal_action.h
Outdated
| /** | ||
| * Use this many bytes for the alternate signal handling stack. | ||
| * | ||
| * This should be a multiple of 4096. |
There was a problem hiding this comment.
mmap man page says you can use sysconf(_SC_PAGE_SIZE) to get this.
source/exe/signal_action.cc
Outdated
|
|
||
| void SignalAction::DoGoodAccessForTest() { | ||
| for (size_t i = 0; i < ALTSTACK_SIZE; ++i) { | ||
| *(altstack_ + GUARD_SIZE + i) = 42; |
There was a problem hiding this comment.
These should be volatile pointers ideally and to prevent the compiler getting too smart.
There was a problem hiding this comment.
I think even the smartest compile cannot assume that there are no side effects from writing into mmap'd memory even if it is technically anonymously mapped... but volatilized it just to be sure.
There was a problem hiding this comment.
Thanks for fixing this. As an aside, it's my understanding that a compiler isn't required to keep track of the fact that a pointer is from mmapped memory, it's not possible in general (e.g. at module boundaries) since the C/C++ type system doesn't track this. As a result, it's allowed to treat any non-volatile pointer as a pointer to normal memory and subject to arbitrary optimization, without worrying about other observers and side-effects.
|
|
||
| void SignalAction::RemoveSigHandlers() const { | ||
| for (const auto& sig : FATAL_SIGS) { | ||
| signal(sig, SIG_DFL); |
There was a problem hiding this comment.
Sure. There's some value in it for tests, but I don't want to make this overkill.
Used sysconf to get the runtime system page size for guard page sizes and made altstack a multiple of that size. Backtrace changed to make all logging critical, skip sentinel frame completely. Error handling changes
htuch
left a comment
There was a problem hiding this comment.
LGTM modulo remaining comments. Looking forward to having stack traces on debug builds! :)
| location where a fatal signal occurred. The signal handler will re-raise the | ||
| fatal signal with the default handler so a core file will still be dumped after | ||
| the stack trace is logged. To inhibit this behavior use | ||
| `--define=signal_trace=disabled` on the Bazel command line. No signal handlers will |
There was a problem hiding this comment.
You might want to set this in the tools/bazel.rc for the ASAN runs. I'm not sure where ASAN installs its signal handlers (based on your experiences with the tests, it seems after this), but this might be safer in general to avoid conflict.
There was a problem hiding this comment.
This should be done now
source/exe/BUILD
Outdated
| deps = [ | ||
| ":envoy_common_lib", | ||
| ":hot_restart_lib", | ||
| ":sigaction_lib", |
There was a problem hiding this comment.
Can you use a select statement here and condition on //bazel:disable_signal_trace to avoid the link dependency on backward.hpp when we do the google import?
| * The memory will be protected from read and write. | ||
| */ | ||
| const size_t guard_size_; | ||
| /** |
There was a problem hiding this comment.
Minor nit: we don't need Doxygen documentation for internal implementation variables, you can use regular C++ comments.
There was a problem hiding this comment.
Is there any reason NOT to use Doxygen formatted comments here? Doxygen can still produce useful documentation for private member vars depending on how it is configured to run. I've found this information to be useful in past projects.
There was a problem hiding this comment.
None other than inconsistency - if we don't have this documented as a convention across the code base in our style guide, then any Doxygen documentation on internal implementation details is a bit special snowflake. For example, unless we do have widespread agreement to use Doxygen for all documents, I plan on continuing to use // comments for internal comments.
source/exe/signal_action.h
Outdated
| * Signal handlers will be reset to the default, NOT back to any signal | ||
| * handler existing before InstallSigHandlers(). | ||
| */ | ||
| void removeSigHandlers() const; |
There was a problem hiding this comment.
It seems technically correct to use const here, since no member variables are modified, but it's somewhat strange to the reader as an annotation, as this method does mutate related global state (the process signal handlers). What do you think of the arguments for logical vs. physical constness of a method at https://isocpp.org/wiki/faq/const-correctness#logical-vs-physical-state? I'm thinking the signal handler table is in some sense being treated as part of the logical state of the object here.
There was a problem hiding this comment.
In theory the compiler could benefit from awareness that the method doesn't modify object state but in this particular instance I see no way that there is any practical impact. And I agree that the method is not "logically" const, and there is benefit to not marking it as such. Since any performance benefit from informing the compiler of immutable state is negligible in this case then I'm OK with removing the const.
mattklein123
left a comment
There was a problem hiding this comment.
Looks great to me with small nit. Thanks agreed this will be very nice in debug builds.
source/exe/signal_action.h
Outdated
| mapAndProtectStackMemory(); | ||
| installSigHandlers(); | ||
| } | ||
| SignalAction(const SignalAction&) = delete; |
There was a problem hiding this comment.
Nit: If you derive from NonCopyable you can delete this. It's redundant.
There was a problem hiding this comment.
D'oh, meant to include that in the change. Thanks.
Make inclusion of sigaction_lib dependency conditional on command line arguments NonCopyable means no need to delete default cctor. Consider signal handler removal logically non-const.
**Description** Follow up on #852 Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
This commit adds signal handlers into main.cc, or any smaller scope as
desired. For any fatal signals a backtrace is logged and then the
signal is re-raised to continue with the default behavior.
The backtrace logging and script are modified slightly to support all
objects found and not make assumptions about the first stack frame being
in the binary we care about. This is needed for reasonable output in the
case of abort() or any fatal signal like SEGV that crops up inside a
library function we call.
Fixes #487