feat(runtimed): add kernel death detection and protocol fixes#592
Merged
feat(runtimed): add kernel death detection and protocol fixes#592
Conversation
Three critical protocol fixes: 1. send_frame: Add send-side frame size check to prevent silent u32 truncation when payload exceeds MAX_FRAME_SIZE. The receive side already enforced this limit but the send side could silently wrap. 2. kernel_manager: Add KernelDied signal when the iopub read loop exits (kernel process crash). Previously, the execution queue would get permanently stuck with executing=Some(cell_id) and no way to unblock. Now the iopub loop sends QueueCommand::KernelDied which resets the queue and broadcasts an error status to all peers. 3. notebook_sync_server: Handle broadcast::RecvError::Lagged on the kernel broadcast channel. Previously, if a peer fell behind on receiving broadcasts (rapid kernel output), the Lagged error was not matched by the Ok() pattern and broadcasts were silently dropped. Now triggers a full Automerge doc sync to catch the peer up on missed output data.
…death detection The previous KernelDied fix relied on the iopub read loop exiting when the kernel dies. However, zeromq auto-reconnects instead of returning an error, so the iopub loop never exits. This adds two complementary detection mechanisms: 1. Process watcher task - spawns alongside the kernel and calls process.wait(). When the process exits (e.g., os._exit(1)), it immediately sends KernelDied. The process is now owned by this task rather than stored in self.process. 2. Heartbeat monitor task - periodically pings the kernel's heartbeat channel. If the kernel becomes unresponsive (alive but hung), it detects this within ~8 seconds and sends KernelDied. Also makes kernel_died() idempotent since both tasks may fire.
The frontend only recognizes exact status values (idle, busy, error, shutdown, etc). The previous "error: Kernel process died unexpectedly" was being ignored because it didn't match. Changed to just "error".
Store process_watcher_task handle immediately after spawn so early return error paths (kernel_info timeout/error) can abort it. This prevents the kernel process from running indefinitely when launch validation fails.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR brings in critical protocol correctness fixes for kernel lifecycle management and adds robust kernel death detection:
Key Changes
process.wait()Verification
os._exit(1)) clear the execution queue within ~100msPR submitted by @rgbkrk's agent, Quill