Skip to content

feat(runtimed): add kernel death detection and protocol fixes#592

Merged
rgbkrk merged 4 commits intomainfrom
quill/cherry-pick-bf9322e
Mar 8, 2026
Merged

feat(runtimed): add kernel death detection and protocol fixes#592
rgbkrk merged 4 commits intomainfrom
quill/cherry-pick-bf9322e

Conversation

@rgbkrk
Copy link
Member

@rgbkrk rgbkrk commented Mar 8, 2026

Summary

This PR brings in critical protocol correctness fixes for kernel lifecycle management and adds robust kernel death detection:

  1. Send-side frame size check (connection.rs): Prevents silent u32 truncation when payload exceeds MAX_FRAME_SIZE
  2. Kernel death detection (kernel_manager.rs): Dual mechanism catches both immediate process exit and unresponsive kernels
  3. Broadcast error handling (notebook_sync_server.rs): Properly handles lagged broadcast subscribers

Key Changes

  • Process watcher: Detects kernel process exit immediately via process.wait()
  • Heartbeat monitor: Periodic pings catch hung kernels within ~8 seconds
  • Idempotent handling: Both mechanisms safely coexist
  • Frontend sync: Status broadcast now sends valid "error" value

Verification

  • Kernel crashes (e.g., os._exit(1)) clear the execution queue within ~100ms
  • UI shows error status instead of idle
  • Can restart kernel after crash
  • Hung kernels detected by heartbeat within 8 seconds

PR submitted by @rgbkrk's agent, Quill

rgbkrk added 4 commits March 7, 2026 14:24
Three critical protocol fixes:

1. send_frame: Add send-side frame size check to prevent silent u32
   truncation when payload exceeds MAX_FRAME_SIZE. The receive side
   already enforced this limit but the send side could silently wrap.

2. kernel_manager: Add KernelDied signal when the iopub read loop exits
   (kernel process crash). Previously, the execution queue would get
   permanently stuck with executing=Some(cell_id) and no way to
   unblock. Now the iopub loop sends QueueCommand::KernelDied which
   resets the queue and broadcasts an error status to all peers.

3. notebook_sync_server: Handle broadcast::RecvError::Lagged on the
   kernel broadcast channel. Previously, if a peer fell behind on
   receiving broadcasts (rapid kernel output), the Lagged error was
   not matched by the Ok() pattern and broadcasts were silently
   dropped. Now triggers a full Automerge doc sync to catch the peer
   up on missed output data.
…death detection

The previous KernelDied fix relied on the iopub read loop exiting when
the kernel dies. However, zeromq auto-reconnects instead of returning
an error, so the iopub loop never exits.

This adds two complementary detection mechanisms:

1. Process watcher task - spawns alongside the kernel and calls
   process.wait(). When the process exits (e.g., os._exit(1)), it
   immediately sends KernelDied. The process is now owned by this
   task rather than stored in self.process.

2. Heartbeat monitor task - periodically pings the kernel's heartbeat
   channel. If the kernel becomes unresponsive (alive but hung), it
   detects this within ~8 seconds and sends KernelDied.

Also makes kernel_died() idempotent since both tasks may fire.
The frontend only recognizes exact status values (idle, busy, error,
shutdown, etc). The previous "error: Kernel process died unexpectedly"
was being ignored because it didn't match. Changed to just "error".
Store process_watcher_task handle immediately after spawn so early
return error paths (kernel_info timeout/error) can abort it. This
prevents the kernel process from running indefinitely when launch
validation fails.
@rgbkrk rgbkrk enabled auto-merge (squash) March 8, 2026 00:25
@rgbkrk rgbkrk merged commit 858b62a into main Mar 8, 2026
14 checks passed
@rgbkrk rgbkrk deleted the quill/cherry-pick-bf9322e branch March 8, 2026 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant