Improving Shadow's support for signals #1851
sporksmith
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Thought I'd share a design I sketched out for improving Shadow's support for signals. This is probably a bit opaque for anyone not already familiar with Shadow's internals, but I'm happy to answer questions.
Overview
Today basic signal sending and delivery is implemented mostly by passing
everything through to native syscalls and signals in the kernel. This largely
works in ptrace mode, but not in preload mode (which is the default due to better performance). There are a few problems:
There's an IPC state machine between Shadow and the shim. Naively sending the
signal to the managed thread disrupts that state machine. This is one of the
main problems we've observed in practice, and e.g. why our current signal
tests are disabled in preload mode
(Handle or avoid nested syscalls in preload mode #1455). This may also cause breakage
in golang's signal-based preemption mechanism
(snowflake simulation broken #1549 (comment)).
The shim itself installs signal handlers for SIGSEGV (to handle traps of the
RDTSC instruction and emulate) and SIGSYS (to handle traps of the seccomp
filter due to syscalls to be intercepted). These currently run on the
thread's regular stack, which results in stack corruption in golang
(snowflake simulation broken #1549 (comment)).
We also currently silently ignore attempts for the managed process to mask or install
their own handlers for these signals, though this hasn't yet caused problems
in practice.
Since the syscall being executed by the shim when the signal is received is
part of the shim's syscall machinery, and not the original syscall that we're
emulating (as it is in ptrace), the
SA_RESTARTflag won't cause the kernelto retry the original syscall for us. (We haven't seen this problem in
practice, but that may only be because things already go wrong earlier).
To address the state-machine issue, my plan is to finish the current syscall/signal from Shadow's perspective,
while pushing a frame on the stack on the shim side. This is analagous to how
the Linux kernel handles signals. This should allow us to cleanly handle syscalls and signals from within a signal handler.
When a managed thread or process is sent a signal, Shadow will set the siginfo for the signal in shared memory. Whenever a thread receives a
SHD_SHIM_EVENT_SYSCALL_COMPLETEmessage, it will check whether there's a pending unblocked signal that it needs to handle before returning from the syscall. Whenthere is one, it will store everything it needs to finish resolving the syscall
(either restarting it or returning an error code) as locals on the current
stack frame, call the handler, and then finish resolving the current syscall
(by returning the originally specified value, or retrying the syscall).
We might be able to save a little bit of emulation and shadowed data
structures by still sending a signal to the thread. In that case we would still
need some way of telling the the thread to prepare to receive a signal, but
instead of calling (or even knowing about) the handler itself, it could just
call something like
sigsuspendto unmask and wait for the signal. I don'tthink this would actually save that much complexity though. e.g. we would still
need to respect the simulated signal masks, either by fully tracking them
ourselves, or saving and restoring the real signal mask in the syscall handling
code (which would require a native syscall for each). We would also still need
to handle retrying the original syscall when appropriate, saving and restoring
interposition state etc. In the end I'm pretty sure this would be a wash at
best.
Most of this proposal focuses on getting signals working well in preload mode.
Since ptrace mode is likely to be dropped unless we find a compelling reason
not to, we won't invest effort in further improving the accuracy of its signal
handling. We will preserve the current functionality by continuing to delegate
to the native syscalls for signal handling in ptrace mode.
Milestones
These don't necessarily map to milestones in shadow's project; it's just a
rough sequence of how to prioritize functionality.
sigaltstack.It appears this may be enough for at least some small golang simulations,
such as the one in snowflake simulation broken #1549.
sigaltstackto run the shim's signal handlers on a dedicated stack.This prevents stack corruption in golang, which can migrate goroutine
stacks between native threads. (I still don't 100% understand exactly how
this results in stack corruption, but golang itself uses
sigaltstackforpresumably a similar reason, and I've confirmed that doing so for our own
handlers fixes the current issue).
sigaltstackfrom actually making the nativecall. In particular on debian11-slim, glibc seems to do this sometimes.
Running our own handlers on the specified stack might work, but they also
drop the the
SS_AUTODISARMflag when doing this, which definitely breaksour code. For this milestone it should be sufficient to mostly fake this
syscall without actually doing much.
calls and corresponding state will need to be emulated instead of being
passed through to the kernel as they are now to prevent interfering with
preload mode's own usage of signals and its communication state machine
between shadow and the shim.
signalfd),we should return an error rather than passing through as a native syscall.
The former should make it relatively easy to identify the failure and see
that we can implement the syscall. Continuing to do the latter may break
things in more subtle ways.
doesn't make sense to prioritize them without a concrete use case.
concrete use cases yet.
any concrete use cases yet.
Overview of code changes
New event type
We will need a new event type in Shadow's internal scheduler:
Scheduled by syscalls such as
killandtgkill. Makingthis an event lets us "unwind the stack" of running those syscalls before
beginning to handle the signal. When the event runs:
(Not sure whether this can actually happen).
without this signal blocked, reschedule this event for that thread.
thread_continuefor the given thread.New data structures
We'll tentatively keep per-thread signal state in the per-thread shared memory
block. I'm not sure yet whether this is necessary, but shouldn't hurt, and may
less us avoid some context switches between Shadow and the shim.
(
sigprocmask(2): Each of the threads in a process has its own signalmask.)
siginfo_t(seesigaction(2)) of pending thread-directed signals. (Alternatively, to savememory this could probably be a bit-field in memory, and then a dynamically
allocated map from signal number to siginfo in Shadow's memory, which could
then be sent over the IPC pipe in a "handle signal" message).
stack_tstruct that specifies an alternate stack on which to runsignal handlers. See
sigaltstack(2).Likewise the following state is per-process, and is probably best kept in a
per-process shared memory block:
siginfo_t(seesigaction(2)) of pending process-directed signals. (Or bitfield + map asfor thread-signal pending siginfos).
struct sigaction, which includes disposition (Term|Ign|Core|Stop|Cont), pointerto handler, and flags. See
signal(7). As a flat array this will belarge-ish (64 signals * ~36 bytes -> 2300 bytes). Might be worth having
shim-side only instead of in shared memory, so that this can start small and
grow dynamically as needed, but some things will be easier and more efficient
if this is available shadow-side as well. Hopefully the shared region won't
spill to more than a page per process anyway.
In shim syscall handling loop
In the syscall handling loop, when a syscall result returns, check whether
there's also a deliverable signal pending by checking the process and
thread-level siginfos and masks. If so, look up disposition in sigaction table:
sigactionto change the nativedisposition of that signal to Term, and then use
tgkillto send the currentthread that signal. The OS will kill the process with that signal, which
Shadow already detects and handles.
Term, use the native signal disposition andtgkillto force realdeath by the signal.
side, mark that the process is now in a 'stopped' state. There's more to
work out here about the details of what that means and how a process gets out
of that state.
stack (it should already be).
sa_flagsfor this signal locally on the stack. Need torefer to these during cleanup.
SA_NODEFERis set in sigaction.sa_maskfrom sigaction.SA_ONSTACKis set in the sigaction and one hasbeen set via
sigaltstackSS_AUTODISARM, save a copy of it locally and clear it._shim_disable_interpositionand set it to 0.i.e. re-enable syscall interposition without unwinding the stack.
siginfoto a local, and clear the one being serviced fromthe thread or process-level siginfo's.
SA_SIGINFOto determine which field of thesigaction has the handler and what its function signature is.)
_shim_disable_interposition, disabling interposition.function, which will return to the old stack if applicable). Also restore
altstackif it was cleared viaSS_AUTODISARM.SA_RESETHANDwas set in the savedsa_flags, then reset sigaction forthis signal to default.
SA_RESTARTwas set in the savedsa_flags(or current flags?sigaction(2)doesn't specify which)relative timeouts, may need to adjust for time passed. See
restart_syscall(2).-EINTR, but e.g. 0in the case of
killsending a signal to the current thread).In shim initialization
sigaltstackto run the shim's handlers on a stack owned bythe shim.
natively while executing code. These include
SIGBUS,SIGFPE, andSIGSEGV. These handlers should call into the same core signal emulationcode as used while accepting signals during a syscall, including switching
the stack if necessary, etc. MS3 since the common case is that these signals
are fatal anyway - either immediately or after a handler writes out some
debugging information.
In Thread
I don't think there's much code change in
ThreadorThreadPreload, other thaninitializing the per-thread data structures.
When a thread is in a blocking syscall and interrupted by a signal, ultimately
its
thread_continuegets called, and it does the same as before - call thesycall handler for the blocked syscall. It's up to
syscallhandler_make_syscalland/or the specific handlers to recognize thatthere's a deliverable signal pending, clean up as needed, and return an
appropriate value for the current syscall (
-EINTR).Signal-related syscall handlers
tgkill(2): MS2: Send a signal to a specific thread. If the specified threadalready has a pending signal with the same signal number, do nothing.
Otherwise write the pending signal info. If the target thread is not the one
currently executing, and that thread doesn't currently have this signal
blocked, schedule a "deliver signal" event for the target thread, for "now".
kill(2): MS2. Similar totgkill, but sends a signal to a process orprocess group. For each targeted process (which will currently only be 1
since we don't track process groups):
in the current thread, continue to next process.
blocked, schedule a "deliver signal" event for that thread.
"process-signal pending siginfos", then iterate through threads until we find
one that hasn't masked that signal (if any) and schedule a "deliver signal"
event for that thread.
pause(2): MS3. Suspends execution until any signal is caught. Basically want toreturn "blocked" from the signal handler and maybe set a flag on the thread.
pidfd_send_signal(2): MS3.restart_syscall(2): MS3. Used by signal trampoline call to restart a syscall,updating time-related parameters if needed (for syscalls that specify a
relative timeout). We shouldn't need to implement a handler for this syscall
per se, but the man page is a good reference for what our equivalent
shim-side code will need to take into account when restarting an interrupted
system call.
rt_sigqueueinfo(2): MS3.setitimer(2): MS3.sigaction(2): Configure how a signal will be handled. This will manipulatethe signal disposition table.
sgetmask(2): MS3. Deprecated version ofsigprocmask.sigaltstack(2): MS1 and MS2.Configures a stack for signal handlers to
execute on (instead of just pushing a frame onto the current thread's regular
stack), and returns information about the current configuration.
interfere with the shim's
sigaltstackconfiguration. We can probably getaway with just faking the returned information about the current
configuration for now.
calls. Actually switch to this stack when running the managed thread's
signal handlers. We might be able to get away without this, but failures
due to this functionality missing may be difficult to debug.
signal(2): MS3. Deprecated alternative tosigaction.signalfd(2): MS3. Creates a file descriptor that can be used to listen forsignals.
sigpending(2): MS3. Returns the set of signals pending for the thread(union of process-directed and thread-directed pending signals).
sigprocmask(2): MS2. Fetch and/or change calling thread's signal mask.sigreturn(2): MS3. This is meant to be called by the kernel's/libc's signalhandling code, which won't be invoked for emulated syscalls.
sigsuspend(2): MS3. Change signal mask,pause(2), then restore signal mask.sigtimedwait(2): MS3. Wait for a signal.sigwaitinfo(2): MS3. Wait for a signal.Blocking-syscall handlers
When we enter
syscallhandler_make_syscallwith an unblocked signal alreadypending, and a blocked syscall was in progress, we may need to do something to
cancel it to avoid losing data, and then we should ultimately return
-EINTR.If we don't do anything to cancel, will anything go wrong? e.g. if we don't
cancel a blocked read, and the thread is later continued when the file
descriptor becomes readable, hopefully we don't lose data? I'd think this
would be the case assuming e.g. no data is copied nor the position in the
stream advanced until the "second half" of the syscall handler runs, which it
never would in this case. If so, then life is simple -
syscallhandler_make_syscalljust returns-EINTRif there's an unblockedsignal pending on entry.
If not, then the next best option is if the base
SysCallHandlerhas enoughinformation that we can generically cancel the current syscall from
syscallhandler_make_syscall. (This could also be desirable to avoid aspurious wakeup later).
I could imagine a case where we need to still invoke the syscall-specific
handler, which would be responsible for recognizing that there's a pending
signal and cleaning up any pending state. ~Every blocking syscall handler would
need to be updated.
TODO: Figure out whether a signal whose disposition is
Ignstill interruptssyscalls. If not then the syscall handler needs to check this and ignore the
signal in that case.
New tests
Incomplete, but noting new tests we need as I think of them:
raise(3)that signal. Checkimmediately afterwards that the global was mutated by the handler.
(
raise(3)guarantees that the handler has already run whenraisereturns).
sigprocmask.raise(3)that signal. Check that global wasn't mutated.Unblock the signal with
sigprocmask- I think what we want to happen isthat the signal handler runs, and then
sigprocmaskreturns without error.Will need to verify that's what happens (or at least can happen) on Linux.
killortgkillfrom another thread.current thread id. Start several threads that sleep in a loop. Use
killtosend the signal to the process. Verify that the handler ran exactly once (it
doesn't matter which thread ran it).
tgkilland verify that the targeted thread ran it.readon a socket and/or pipe to be blocked. From anotherthread, unblock the operation (by writing to the other end), and then send a
signal to interrupt the operation. No data should be lost. Either the
operation is interrupted but subsequent reads correctly pick up where the
last successful read left off, or the operation completes (and the signal is
still handled). I'm not sure to what extent the latter is legal - e.g. when
the target thread runs again is it legal to run the handler for the pending
signal and then return successfully from the syscall?
whose disposition is
Ign. Not sure whether the correct behavior is toremain blocked on the syscall or have the syscall return
-EINTR.Questions/TODO
for the process? I'm guessing both could be delivered, potentially to the
same thread, one via the process level siginfo and one via the thread level
siginfo?
Other notes
process-directed vs thread-directed signals in
signal(7):Beta Was this translation helpful? Give feedback.
All reactions