Debugging: add a debugger callback mechanism to handle debug events.#11895
Debugging: add a debugger callback mechanism to handle debug events.#11895cfallin merged 10 commits intobytecodealliance:mainfrom
Conversation
This PR adds a notion of "debug events", and a mechanism in Wasmtime to associate a "debug handler" with a store such that the handler is invoked as-if it were an async hostcall on each event. The async handler owns the store while its future exists, so the whole "world" (within the store) is frozen and the handler can examine any state it likes with a `StoreContextMut`. Note that this callback-based scheme is a compromise: eventually, we would like to have a native async API that produces a stream of events, as sketched in bytecodealliance#11826 and in [this branch]. However, the async approach implemented naively (that is, with manual fiber suspends and with state passed on the store) suffers from unsoundness in the presence of dropped futures. Alex, Nick and I discussed this extensively and agreed that the `Accessor` mechanism is the right way to allow for a debugger to have "timesliced"/"shared" access to a store (only when polled/when an event is delivered), but we will defer that for now, because it requires additional work (mainly, converting existing async yield points in the runtime to "give up" the store with the `run_concurrent` mechanism). I'll file a followup issue to track that. The idea is that we can eventually build that when ready, but the API we provide to a debugger component can remain unchanged; only this plumbing and the glue to the debugger component will be reworked. With this scheme based on callbacks, we expect that one should be able to implement a debugger using async channels to communicate with the callback. The idea is that there would be a protocol where the callback sends a debug event to the debugger main loop elsewhere in the executor (e.g., over a Tokio channel or other async channel mechanism), and when the debugger wants to allow execution to continue, it sends a "continue" message back. In the meantime, while the world is paused, the debugger can send messages to the callback to query the `StoreContextMut` it has and read out state. This indirection/proxying of Store access is necessary for soundness: again, teleporting the Store out may look like it almost works ("it is like a mutable reborrow on a hostcall") except in the presence of dropped futures with sandwiched Wasm->host->Wasm situations. This PR implements debug events for a few cases that can be caught directly in the runtime, e.g., exceptions and traps raised just before re-entry to Wasm. Other kinds of traps, such as those normally implemented by host signals, require additional work (as in bytecodealliance#11826) to implement "hostcall injection" on signal reception; and breakpoints will be built on top of that. The point of this PR is only to get the initial plumbing in place for events. [this branch]: https://github.com/cfallin/wasmtime/tree/wasmtime-debug-async
3d94175 to
8272afb
Compare
fitzgen
left a comment
There was a problem hiding this comment.
LGTM! Will hold off on adding to the merge queue until @alexcrichton has a chance to look at this.
alexcrichton
left a comment
There was a problem hiding this comment.
Would it make sense to also start something like a mask of events for a debug handler? For example if a HostcallError isn't desired then that could get masked out while other events would still be processed.
…citly list UnwindState cases.
Maybe eventually, yeah -- that's a good idea. I might defer that to the point that we're building out the top half to see how we want to use it. |
|
OK, updated based on feedback -- final look before I merge? Thanks! |
|
With respect to clones, object-safety, lifetimes, and |
|
Ah, neat, an internal adapter that gets monomorphized in the top layer where we have T -- I'll incorporate that, thanks! |
…ait object-safe. Co-authored-by: Alex Crichton <alex@alexcrichton.com>
|
All of that worked fine, thanks! I pulled in your patch and then added some more comments to the trait to indicate that it should be cheap to clone, i.e., is recommended to be an |
|
I pushed a change to avoid this failure where execution in Pulley did not emit a debug event for a divide-by-zero; this is because the trap-handling path is ever so slightly different in Pulley. I'll address in a followup (we're very close and it'd be honestly easier to keep Pulley working throughout this process than to exclude it in config); the |
…roperly in Pulley. This is a followup to bytecodealliance#11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently. This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction. This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score: - A raw `NonNull<dyn VMStore>` in the `CallThreadState`, with a long comment about provenance and mut-borrow exclusivity. This is needed right now to allow the interpreter to invoke the debug event handler, but will soon be needed when injecting hostcalls on signals, because a signal context also has no state available from the Wasm code other than what is in TLS. Hence, we need a way to get the store back from the Wasm when we do something that is "morally a hostcall" at a trapping instruction. I do believe this is sound, or at least close to it if not (please scrutinize carefully!); the basic idea is that the Wasm acts as an opaque blob in the middle, and the pointer comes out of it one way or another (the normal way, as the first arg to a hostcall, or the weird way, via TLS and the CallThreadState during a trap). Exclusive ownership is still clear at any given point and only one `&mut` ever exists in the current frame at a time. That said, I haven't tested with miri yet. This does require careful thought about the Wasm compilation, too; we need the moral equivalent of a `&mut self` reborrow as-if we were making a hostcall on each trapping instruction. It turns out that we already treat them as memory-fence instructions, so nothing loaded from the store can be moved or cached across them, and I've added a comment now about how this is load-bearing. - Updates to `CallThreadState`'s "exit state", normally set by the exit trampoline, that we now also set when we invoke a debug event handler during a trap context[^1] so that `Store::debug_frames` properly sees the current activation. This is a little more awkward than it could be because we store the *tramopline* FP, not last Wasm FP, and there is no trampoline frame in this case, so I've added a flag and some conditionals. I'm happy to refactor instead to go (back) to storing the last Wasm FP instead, with the extra load in the exit trampoline to compute that. - A whole bunch of plumbing, creating a large but mechanical diff, in the code translator to actually add debug tags on all traps and calls to `raise`. It turns out that once I got all of the above working in Pulley, the test disagreed about current Wasm PC between native and Pulley, and Pulley was right; native was getting it wrong because the `ud2` was sunk to the bottom in a cold block and, without tags, we scanned backward to pick up the last Wasm PC in the function. This new plumbing and addition of tags in all the appropriate places fixes that. [^1]: I keep saying "during a trap context" here, but to avoid any signal-safety scares, note that when this is done for native signals in a followup PR, we will inject a hostcall by modifying stack/register state and returning from the actual signal context, so it really is as-if we did a hostcall from a trapping instruction.
…roperly in Pulley. This is a followup to bytecodealliance#11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently. This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction. This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score: - A raw `NonNull<dyn VMStore>` in the `CallThreadState`, with a long comment about provenance and mut-borrow exclusivity. This is needed right now to allow the interpreter to invoke the debug event handler, but will soon be needed when injecting hostcalls on signals, because a signal context also has no state available from the Wasm code other than what is in TLS. Hence, we need a way to get the store back from the Wasm when we do something that is "morally a hostcall" at a trapping instruction. I do believe this is sound, or at least close to it if not (please scrutinize carefully!); the basic idea is that the Wasm acts as an opaque blob in the middle, and the pointer comes out of it one way or another (the normal way, as the first arg to a hostcall, or the weird way, via TLS and the CallThreadState during a trap). Exclusive ownership is still clear at any given point and only one `&mut` ever exists in the current frame at a time. That said, I haven't tested with miri yet. This does require careful thought about the Wasm compilation, too; we need the moral equivalent of a `&mut self` reborrow as-if we were making a hostcall on each trapping instruction. It turns out that we already treat them as memory-fence instructions, so nothing loaded from the store can be moved or cached across them, and I've added a comment now about how this is load-bearing. - Updates to `CallThreadState`'s "exit state", normally set by the exit trampoline, that we now also set when we invoke a debug event handler during a trap context[^1] so that `Store::debug_frames` properly sees the current activation. This is a little more awkward than it could be because we store the *tramopline* FP, not last Wasm FP, and there is no trampoline frame in this case, so I've added a flag and some conditionals. I'm happy to refactor instead to go (back) to storing the last Wasm FP instead, with the extra load in the exit trampoline to compute that. - A whole bunch of plumbing, creating a large but mechanical diff, in the code translator to actually add debug tags on all traps and calls to `raise`. It turns out that once I got all of the above working in Pulley, the test disagreed about current Wasm PC between native and Pulley, and Pulley was right; native was getting it wrong because the `ud2` was sunk to the bottom in a cold block and, without tags, we scanned backward to pick up the last Wasm PC in the function. This new plumbing and addition of tags in all the appropriate places fixes that. [^1]: I keep saying "during a trap context" here, but to avoid any signal-safety scares, note that when this is done for native signals in a followup PR, we will inject a hostcall by modifying stack/register state and returning from the actual signal context, so it really is as-if we did a hostcall from a trapping instruction.
…roperly in Pulley. This is a followup to bytecodealliance#11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently. This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction. This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score: - A raw `NonNull<dyn VMStore>` in the `CallThreadState`, with a long comment about provenance and mut-borrow exclusivity. This is needed right now to allow the interpreter to invoke the debug event handler, but will soon be needed when injecting hostcalls on signals, because a signal context also has no state available from the Wasm code other than what is in TLS. Hence, we need a way to get the store back from the Wasm when we do something that is "morally a hostcall" at a trapping instruction. I do believe this is sound, or at least close to it if not (please scrutinize carefully!); the basic idea is that the Wasm acts as an opaque blob in the middle, and the pointer comes out of it one way or another (the normal way, as the first arg to a hostcall, or the weird way, via TLS and the CallThreadState during a trap). Exclusive ownership is still clear at any given point and only one `&mut` ever exists in the current frame at a time. That said, I haven't tested with miri yet. This does require careful thought about the Wasm compilation, too; we need the moral equivalent of a `&mut self` reborrow as-if we were making a hostcall on each trapping instruction. It turns out that we already treat them as memory-fence instructions, so nothing loaded from the store can be moved or cached across them, and I've added a comment now about how this is load-bearing. - Updates to `CallThreadState`'s "exit state", normally set by the exit trampoline, that we now also set when we invoke a debug event handler during a trap context[^1] so that `Store::debug_frames` properly sees the current activation. This is a little more awkward than it could be because we store the *tramopline* FP, not last Wasm FP, and there is no trampoline frame in this case, so I've added a flag and some conditionals. I'm happy to refactor instead to go (back) to storing the last Wasm FP instead, with the extra load in the exit trampoline to compute that. - A whole bunch of plumbing, creating a large but mechanical diff, in the code translator to actually add debug tags on all traps and calls to `raise`. It turns out that once I got all of the above working in Pulley, the test disagreed about current Wasm PC between native and Pulley, and Pulley was right; native was getting it wrong because the `ud2` was sunk to the bottom in a cold block and, without tags, we scanned backward to pick up the last Wasm PC in the function. This new plumbing and addition of tags in all the appropriate places fixes that. [^1]: I keep saying "during a trap context" here, but to avoid any signal-safety scares, note that when this is done for native signals in a followup PR, we will inject a hostcall by modifying stack/register state and returning from the actual signal context, so it really is as-if we did a hostcall from a trapping instruction.
…roperly in Pulley. This is a followup to bytecodealliance#11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently. This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction. This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score: - A raw `NonNull<dyn VMStore>` in the `CallThreadState`, with a long comment about provenance and mut-borrow exclusivity. This is needed right now to allow the interpreter to invoke the debug event handler, but will soon be needed when injecting hostcalls on signals, because a signal context also has no state available from the Wasm code other than what is in TLS. Hence, we need a way to get the store back from the Wasm when we do something that is "morally a hostcall" at a trapping instruction. I do believe this is sound, or at least close to it if not (please scrutinize carefully!); the basic idea is that the Wasm acts as an opaque blob in the middle, and the pointer comes out of it one way or another (the normal way, as the first arg to a hostcall, or the weird way, via TLS and the CallThreadState during a trap). Exclusive ownership is still clear at any given point and only one `&mut` ever exists in the current frame at a time. That said, I haven't tested with miri yet. This does require careful thought about the Wasm compilation, too; we need the moral equivalent of a `&mut self` reborrow as-if we were making a hostcall on each trapping instruction. It turns out that we already treat them as memory-fence instructions, so nothing loaded from the store can be moved or cached across them, and I've added a comment now about how this is load-bearing. - Updates to `CallThreadState`'s "exit state", normally set by the exit trampoline, that we now also set when we invoke a debug event handler during a trap context[^1] so that `Store::debug_frames` properly sees the current activation. This is a little more awkward than it could be because we store the *trampoline* FP, not last Wasm FP, and there is no trampoline frame in this case, so I've added a flag and some conditionals. I'm happy to refactor instead to go (back) to storing the last Wasm FP instead, with the extra load in the exit trampoline to compute that. - A whole bunch of plumbing, creating a large but mechanical diff, in the code translator to actually add debug tags on all traps and calls to `raise`. It turns out that once I got all of the above working in Pulley, the test disagreed about current Wasm PC between native and Pulley, and Pulley was right; native was getting it wrong because the `raise` libcall was sunk to the bottom in a cold block and, without tags, we scanned backward to pick up the last Wasm PC in the function. This new plumbing and addition of tags in all the appropriate places fixes that. [^1]: I keep saying "during a trap context" here, but to avoid any signal-safety scares, note that when this is done for native signals in a followup PR, we will inject a hostcall by modifying stack/register state and returning from the actual signal context, so it really is as-if we did a hostcall from a trapping instruction.
…roperly in Pulley. This is a followup to bytecodealliance#11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently. This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction. This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score: - A raw `NonNull<dyn VMStore>` in the `CallThreadState`, with a long comment about provenance and mut-borrow exclusivity. This is needed right now to allow the interpreter to invoke the debug event handler, but will soon be needed when injecting hostcalls on signals, because a signal context also has no state available from the Wasm code other than what is in TLS. Hence, we need a way to get the store back from the Wasm when we do something that is "morally a hostcall" at a trapping instruction. I do believe this is sound, or at least close to it if not (please scrutinize carefully!); the basic idea is that the Wasm acts as an opaque blob in the middle, and the pointer comes out of it one way or another (the normal way, as the first arg to a hostcall, or the weird way, via TLS and the CallThreadState during a trap). Exclusive ownership is still clear at any given point and only one `&mut` ever exists in the current frame at a time. That said, I haven't tested with miri yet. This does require careful thought about the Wasm compilation, too; we need the moral equivalent of a `&mut self` reborrow as-if we were making a hostcall on each trapping instruction. It turns out that we already treat them as memory-fence instructions, so nothing loaded from the store can be moved or cached across them, and I've added a comment now about how this is load-bearing. - Updates to `CallThreadState`'s "exit state", normally set by the exit trampoline, that we now also set when we invoke a debug event handler during a trap context[^1] so that `Store::debug_frames` properly sees the current activation. This is a little more awkward than it could be because we store the *trampoline* FP, not last Wasm FP, and there is no trampoline frame in this case, so I've added a flag and some conditionals. I'm happy to refactor instead to go (back) to storing the last Wasm FP instead, with the extra load in the exit trampoline to compute that. - A whole bunch of plumbing, creating a large but mechanical diff, in the code translator to actually add debug tags on all traps and calls to `raise`. It turns out that once I got all of the above working in Pulley, the test disagreed about current Wasm PC between native and Pulley, and Pulley was right; native was getting it wrong because the `raise` libcall was sunk to the bottom in a cold block and, without tags, we scanned backward to pick up the last Wasm PC in the function. This new plumbing and addition of tags in all the appropriate places fixes that. [^1]: I keep saying "during a trap context" here, but to avoid any signal-safety scares, note that when this is done for native signals in a followup PR, we will inject a hostcall by modifying stack/register state and returning from the actual signal context, so it really is as-if we did a hostcall from a trapping instruction.
This PR adds a notion of "debug events", and a mechanism in Wasmtime to associate a "debug handler" with a store such that the handler is invoked as-if it were an async hostcall on each event. The async handler owns the store while its future exists, so the whole "world" (within the store) is frozen and the handler can examine any state it likes with a
StoreContextMut.Note that this callback-based scheme is a compromise: eventually, we would like to have a native async API that produces a stream of events, as sketched in #11826 and in this branch. However, the async approach implemented naively (that is, with manual fiber suspends and with state passed on the store) suffers from unsoundness in the presence of dropped futures. Alex, Nick and I discussed this extensively and agreed that the
Accessormechanism is the right way to allow for a debugger to have "timesliced"/"shared" access to a store (only when polled/when an event is delivered), but we will defer that for now, because it requires additional work (mainly, converting existing async yield points in the runtime to "give up" the store with therun_concurrentmechanism). I'll file a followup issue to track that. The idea is that we can eventually build that when ready, but the API we provide to a debugger component can remain unchanged; only this plumbing and the glue to the debugger component will be reworked.With this scheme based on callbacks, we expect that one should be able to implement a debugger using async channels to communicate with the callback. The idea is that there would be a protocol where the callback sends a debug event to the debugger main loop elsewhere in the executor (e.g., over a Tokio channel or other async channel mechanism), and when the debugger wants to allow execution to continue, it sends a "continue" message back. In the meantime, while the world is paused, the debugger can send messages to the callback to query the
StoreContextMutit has and read out state. This indirection/proxying of Store access is necessary for soundness: again, teleporting the Store out may look like it almost works ("it is like a mutable reborrow on a hostcall") except in the presence of dropped futures with sandwiched Wasm->host->Wasm situations.This PR implements debug events for a few cases that can be caught directly in the runtime, e.g., exceptions and traps raised just before re-entry to Wasm. Other kinds of traps, such as those normally implemented by host signals, require additional work (as in #11826) to implement "hostcall injection" on signal reception; and breakpoints will be built on top of that. The point of this PR is only to get the initial plumbing in place for events.
Thanks to Alex for help on working how how to keep this from requiring
T: Sendthroughout Wasmtime!