Skip to content

Backtrace on fatal signals.#855

Merged
mattklein123 merged 8 commits intoenvoyproxy:masterfrom
dnoe:sigaction
May 1, 2017
Merged

Backtrace on fatal signals.#855
mattklein123 merged 8 commits intoenvoyproxy:masterfrom
dnoe:sigaction

Conversation

@dnoe
Copy link
Copy Markdown
Contributor

@dnoe dnoe commented Apr 27, 2017

This commit adds signal handlers into main.cc, or any smaller scope as
desired. For any fatal signals a backtrace is logged and then the
signal is re-raised to continue with the default behavior.

The backtrace logging and script are modified slightly to support all
objects found and not make assumptions about the first stack frame being
in the binary we care about. This is needed for reasonable output in the
case of abort() or any fatal signal like SEGV that crops up inside a
library function we call.

Fixes #487

This commit adds signal handlers into main.cc, or any smaller scope as
desired.  For any fatal signals a backtrace is logged and then the
signal is re-raised to continue with the default behavior.

The backtrace logging and script are modified slightly to support all
objects found and not make assumptions about the first stack frame being
in the binary we care about. This is needed for reasonable output in the
case of abort() or any fatal signal like SEGV that crops up inside a
library function we call.
@dnoe
Copy link
Copy Markdown
Contributor Author

dnoe commented Apr 27, 2017

@htuch

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the use of RAII to capture signal handler install/removal :)

*
* The signal handler must run on an alternative stack so that we can do the
* stack unwind on the original stack. Memory is allocated for this purpose when
* this object is constructed. When this object goes out of scope the memory is
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "the memory belonging to the alternate stack"

* handler existing before InstallSigHandlers().
*/
void RemoveSigHandlers() const;
std::unique_ptr<char[]> altstack_;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest that this probably isn't the best way to allocate a stack for a few reasons:

  1. The stack won't be aligned. Many architectures (including ARMv8) expect the stack pointer to have an alignment, e.g. 64-bit.
  2. Ideally we would have a guard page after the stack to catch overflow.

I think (1) can be fixed fairly easily with an aligned attribute (just pick some sufficiently wide alignment).

I worry a bit about (2), since it might become a security hazard one day if an attacker figures out how to trigger a signal handler and get it to behave badly on the stack, walking over it into arbitrary heap. The typical way to add some defense here is to add a guard page, either via a linker script. Adding a linker script might take you into Bazel hell though...

You could allocate the memory with mmap and mprotect a guard page. That might be the simplest way.

So, what to do here might depend on how strongly we believe the signal handlers will behave well and are sufficiently hardened to overflow attacks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mmap of two pages, with the second page mprotect'd as a guard page is a great solution assuming there aren't any technical blockers to doing it that way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the man page it is implementation defined which direction the stack grows within the memory block provided. To fit within this it may be necessary to have a guard page on either side of the memory block provided to sigaltstack.

// x86_64
error_pc = reinterpret_cast<void*>(ucontext->uc_mcontext.gregs[REG_RIP]);
#elif defined(REG_EIP)
// x86 Classic - not tested
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of untested code - can we just put these in comments and #warning?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I will do that until/unless we have a way of testing them.

stack.ss_size = ALTSTACK_SIZE;
stack.ss_flags = 0;

if (sigaltstack(&stack, nullptr) < 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer RELEASE_ASSERT per #854 for now. If sigaltstack fails, something terrible has happened.

stack.ss_size = ALTSTACK_SIZE;
stack.ss_flags = 0;

if (sigaltstack(&stack, nullptr) < 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to zero the stack before setting it here. Best to not have random heap stuff sitting on a stack for security reasons.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like mmap with MAP_ANONYMOUS will map a zero'd page, so doing that will address this as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. There's also MAP_STACK | MAP_GROWSDOWN that might be handy.


void SignalAction::RemoveSigHandlers() const {
for (const auto& sig : FATAL_SIGS) {
signal(sig, SIG_DFL);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're doing RAII, how about remembering the oldact returned by sigaction above and restoring it? Not that it matters much here, but there's something nice about the symmetry there and RAII.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I wanted to avoid the unneeded complexity if this feature isn't needed but it isn't too difficult to implement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to skip this unneeded complexity for now, it isn't needed by any use case in Envoy today. I can do it later if such a use case pops up.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. There's some value in it for tests, but I don't want to make this overkill.

sigemptyset(&saction.sa_mask);
saction.sa_flags = (SA_SIGINFO | SA_ONSTACK | SA_RESETHAND | SA_NODEFER);
saction.sa_sigaction = SigHandler;
if (sigaction(sig, &saction, nullptr) < 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with RELEASE_ASSERT

@mattklein123
Copy link
Copy Markdown
Member

Quick drive by comment: can we put even the installation of the extra signals behind some build flag? I don't want to have them installed in our release builds.

@dnoe
Copy link
Copy Markdown
Contributor Author

dnoe commented Apr 27, 2017

I think that's definitely possible, I'll check with @htuch about the proper way to do it. I was under the impression your production environment had an alternate main.cc which is why I used that as the integration point.

@mattklein123
Copy link
Copy Markdown
Member

We don't have an alternate main right now. We can make one, but IMO it would be easier to just support a define of some type that allows this feature to be turned off and on, and we can default to on for debug, maybe default to on for release, but have a way to disable if desired.


void SignalAction::InstallSigHandlers() {
stack_t stack;
stack.ss_sp = altstack_.get();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hang on a sec, how did this work?! :) Stacks grow down, so I think your stack was walking over the heap. Thanks for adding the guard page - when computing the initial SP, take the top address and subtract 8 bytes - that will give 64-bit alignment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I believe from the man page this is the correct way to interact with the implementation:

" On most hardware architectures supported by Linux, stacks grow
downward. sigaltstack() automatically takes account of the direction
of stack growth."

" When a signal handler is invoked on the alternate stack, the kernel
automatically aligns the address given in ss.ss_sp to a suitable
address boundary for the underlying hardware architecture."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, great, that's reassuring :)

@dnoe
Copy link
Copy Markdown
Contributor Author

dnoe commented Apr 27, 2017

(ASAN test fails because there is an invalid memory access - on purpose! - but the ASAN output when it dies is different from what my test expects. I'll investigate how to address this tomorrow)

@htuch
Copy link
Copy Markdown
Member

htuch commented Apr 27, 2017

We could probably sneak the main.cc shim into this PR, it's just a simple file reorg and on point.

dnoe added 2 commits April 28, 2017 14:10
Switched to using mmap with mprotect()ed guard pages for the alternate
signal stack.  Added test code for guarded stack.

Fail harder if we fail normal POSIX commands sigaction, mmap, etc.

Default signal capturing behavior can be inhibited with
"--define=signal_trace=disabled" build option

Inhibited some tests under ASAN where the ASAN signal grabbing behavior
render the test pointless and failing.

tools/stack_decode.py now properly passes the return code back so test
don't magically pass when --run_under

Docs++
Now with fix_format actually run...
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good from quick pass, just some style comments.

#ifdef REG_RIP
// x86_64
error_pc = reinterpret_cast<void*>(ucontext->uc_mcontext.gregs[REG_RIP]);
#elif defined(REG_EIP)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we just drop all of the elif clauses and have a single warning without the commented out code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to retain the examples as documentation, I believe they are correct but I have no easy way to test them at this time. I'll lump it into one comment with an #else to keep it compact.

stack.ss_size = ALTSTACK_SIZE; // ... guard page at the other
stack.ss_flags = 0;

if (sigaltstack(&stack, nullptr) < 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Just drop the if statement here and just RELEASE_ASSERT the call directly, same for all cases below. You don't need any of the error output. Line # is enough.

}

void SignalAction::UnmapStackMemory() {
if (altstack_ != nullptr) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever happen with crashing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we release assert if the mmap doesn't succeed, I believe the answer is no. I can remove the check. Inspecting the munmap man page it says "It is not an error if the indicated range does not contain any mapped pages" and although addr must be page aligned 0x0 does actually obey this restriction so it seems to be a safe no-op anyway.

MapAndProtectStackMemory();
InstallSigHandlers();
}
SignalAction(const SignalAction&) = delete;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can derive from NonCopyable if you want.

/**
* Helpers for testing guarded stack memory
*/
void DoGoodAccessForTest();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: style guide says methods start with lower case. Same for all methods here, and also in the backtrace file. Can you fix that also?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this, looking at Google-styled code clearly warped by brain because I was carefully capitalizing the first letters.

/**
* Use this many bytes for the alternate signal handling stack.
*
* This should be a multiple of 4096.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 4096 is arch dependent. It doesn't really matter but might want to put a note that is the case. Might want to at least hide behind constant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Programmatically it is hidden behind GUARD_SIZE - I'll update the comment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmap man page says you can use sysconf(_SC_PAGE_SIZE) to get this.

*
* @param address The stack trace will begin from this address.
*/
void CaptureFrom(void* address) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, do we use any of the log in prod stuff anymore? Can we just kill it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK if backtraces are always logged at critical level? The point of the log_in_prod uglyness was to work around the fact that we don't actually emit logs below that level in optimized builds, and this feature should still be useful for investigating issues in optimized builds. In practice with the signal handler either something has gone very wrong, or, a developer has manually added a statement to assist in debugging. In both cases critical seems fine to me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I would just use critical to simplify the code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I will add that into this PR.

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the comments below.

piped_input = ""
obj_name = address_list.obj_file
# end of stack sentinel
if obj_name == "[0xffffffffffffffff]":
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this from Backward? This seems somewhat architecture specific, can you generalize?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I make the other change below to stop before the last item then this can be removed too.

#ifndef ASANITIZED
TEST(Signals, InvalidAddressDeathTest) {
SignalAction actions;
EXPECT_DEATH([]() -> void {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make these tests much stronger by implementing the behavior we discussed, where you call the original signal handler after dumping the stack trace and installing a signal handler before the definitions of actions. That way you could verify that the stack trace dump signal handler didn't crash itself, and possibly that it ran if you set some global state.

Anyway, it's probably overkill for this kind of feature (you already have done a lot of test stuff), but just an FYI.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea. It still isn't compatible with the ASAN issue, which seems to inhibit normal signal handling entirely, but it would make the test more robust. I might implement the test as a "fun project" in a separate PR when I have some hacking time but if you're OK with the current state of the test I'm good with merging it as it stands.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, whatever you reckon is best, I don't think we need this to merge.

for (unsigned int i = 0; i < stack_trace_.size(); ++i) {
// Backtrace gets tagged by ASAN when we try the object name resolution for the last
// frame on stack, so skip the last one. It has no useful info anyway.
if (i + 1 != stack_trace_.size()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write for (unsigned int i = 0; i < stack_trace.size() - 1; ++i) { above and skip this conditional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried that this is arch specific, and Backwards won't always return this sentinel stack frame as the last item. But, it should always be safe to skip the last item, the risk is that we don't include some relevant stack frame. I'll switch to stopping at size-1 which should guarantee no ASAN issues like I saw, at the possible risk of losing the bottom of stack which we probably don't care about anyway.

/**
* Use this many bytes for the alternate signal handling stack.
*
* This should be a multiple of 4096.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmap man page says you can use sysconf(_SC_PAGE_SIZE) to get this.


void SignalAction::DoGoodAccessForTest() {
for (size_t i = 0; i < ALTSTACK_SIZE; ++i) {
*(altstack_ + GUARD_SIZE + i) = 42;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be volatile pointers ideally and to prevent the compiler getting too smart.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even the smartest compile cannot assume that there are no side effects from writing into mmap'd memory even if it is technically anonymously mapped... but volatilized it just to be sure.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. As an aside, it's my understanding that a compiler isn't required to keep track of the fact that a pointer is from mmapped memory, it's not possible in general (e.g. at module boundaries) since the C/C++ type system doesn't track this. As a result, it's allowed to treat any non-volatile pointer as a pointer to normal memory and subject to arbitrary optimization, without worrying about other observers and side-effects.


void SignalAction::RemoveSigHandlers() const {
for (const auto& sig : FATAL_SIGS) {
signal(sig, SIG_DFL);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. There's some value in it for tests, but I don't want to make this overkill.

dnoe added 2 commits April 28, 2017 19:28
Used sysconf to get the runtime system page size for guard page sizes
and made altstack a multiple of that size.

Backtrace changed to make all logging critical, skip sentinel frame
completely.

Error handling changes
Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo remaining comments. Looking forward to having stack traces on debug builds! :)

location where a fatal signal occurred. The signal handler will re-raise the
fatal signal with the default handler so a core file will still be dumped after
the stack trace is logged. To inhibit this behavior use
`--define=signal_trace=disabled` on the Bazel command line. No signal handlers will
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to set this in the tools/bazel.rc for the ASAN runs. I'm not sure where ASAN installs its signal handlers (based on your experiences with the tests, it seems after this), but this might be safer in general to avoid conflict.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done now

source/exe/BUILD Outdated
deps = [
":envoy_common_lib",
":hot_restart_lib",
":sigaction_lib",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use a select statement here and condition on //bazel:disable_signal_trace to avoid the link dependency on backward.hpp when we do the google import?

* The memory will be protected from read and write.
*/
const size_t guard_size_;
/**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: we don't need Doxygen documentation for internal implementation variables, you can use regular C++ comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason NOT to use Doxygen formatted comments here? Doxygen can still produce useful documentation for private member vars depending on how it is configured to run. I've found this information to be useful in past projects.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None other than inconsistency - if we don't have this documented as a convention across the code base in our style guide, then any Doxygen documentation on internal implementation details is a bit special snowflake. For example, unless we do have widespread agreement to use Doxygen for all documents, I plan on continuing to use // comments for internal comments.

* Signal handlers will be reset to the default, NOT back to any signal
* handler existing before InstallSigHandlers().
*/
void removeSigHandlers() const;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems technically correct to use const here, since no member variables are modified, but it's somewhat strange to the reader as an annotation, as this method does mutate related global state (the process signal handlers). What do you think of the arguments for logical vs. physical constness of a method at https://isocpp.org/wiki/faq/const-correctness#logical-vs-physical-state? I'm thinking the signal handler table is in some sense being treated as part of the logical state of the object here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory the compiler could benefit from awareness that the method doesn't modify object state but in this particular instance I see no way that there is any practical impact. And I agree that the method is not "logically" const, and there is benefit to not marking it as such. Since any performance benefit from informing the compiler of immutable state is negligible in this case then I'm OK with removing the const.

mattklein123
mattklein123 previously approved these changes Apr 30, 2017
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me with small nit. Thanks agreed this will be very nice in debug builds.

mapAndProtectStackMemory();
installSigHandlers();
}
SignalAction(const SignalAction&) = delete;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If you derive from NonCopyable you can delete this. It's redundant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh, meant to include that in the change. Thanks.

@mattklein123
Copy link
Copy Markdown
Member

@dnoe FYI I put "Fixes #487" in the description of this PR. That will make it auto close that issue when this is merged.

dnoe added 2 commits May 1, 2017 10:40
Make inclusion of sigaction_lib dependency conditional on command line
arguments

NonCopyable means no need to delete default cctor.

Consider signal handler removal logically non-const.
@mattklein123 mattklein123 merged commit a951c3b into envoyproxy:master May 1, 2017
mathetake added a commit that referenced this pull request Mar 3, 2026
**Description**

Follow up on #852

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants