Skip to content

refactor(runtimed): reduce relay to pure byte pipe — two-peer Automerge sync (#600)#608

Merged
rgbkrk merged 3 commits intomainfrom
pure-relay
Mar 8, 2026
Merged

refactor(runtimed): reduce relay to pure byte pipe — two-peer Automerge sync (#600)#608
rgbkrk merged 3 commits intomainfrom
pure-relay

Conversation

@rgbkrk
Copy link
Member

@rgbkrk rgbkrk commented Mar 8, 2026

Closes #600.

The Tauri relay (NotebookSyncClient) was a full Automerge peer — it maintained its own AutoCommit doc, merged sync messages from both frontend and daemon, and generated new sync messages for each peer. This created a three-peer topology (frontend ↔ relay ↔ daemon) that was the root cause of 6+ correctness findings from the protocol audit.

What changed

Step 1: Daemon requests for doc bytes and metadata

New NotebookRequest variants — GetDocBytes, GetRawMetadata { key }, SetRawMetadata { key, value } — let the Tauri app read/write the daemon's canonical Automerge doc directly via the request/response protocol, without going through the relay's local replica.

Step 2: Reroute Tauri callers

get_automerge_doc_bytes, get_metadata_snapshot, set_metadata_snapshot, get_raw_metadata_additional, set_raw_trust_in_metadata — all now use send_request() instead of relay-local get_metadata()/set_metadata()/get_doc_bytes().

Step 3: Byte pipe in run_sync_task

When raw_sync_tx is present (Tauri mode), the relay is now a transparent byte forwarder:

Operation Before (3-peer) After (2-peer pipe)
Frontend → daemon sync decode → merge into relay doc → sync_to_daemon() → re-encode forward raw bytes as AutomergeSync frame
Daemon → frontend sync merge into relay doc → generate_sync_message(fe_state) forward raw AutomergeSync frame payload
GetDocBytes serialize relay's AutoCommit + virtual sync handshake skipped in pipe mode (frontend uses daemon request)
Metadata diffing read from relay doc, diff, emit SyncUpdate skipped (frontend WASM drives metadata via useSyncExternalStore)

runtimed-py retains the full peer mode (no raw_sync_tx) — unchanged behavior.

Audit findings resolved

For the Tauri path, these protocol correctness findings are eliminated:

  • sync_to_daemon() drops broadcast frames during ack wait — relay no longer calls sync_to_daemon()
  • Triple-merge divergence risk — two peers (frontend ↔ daemon) instead of three
  • Virtual sync handshake 10-iteration limit — no virtual handshake in pipe mode
  • try_send drops sync updates silently — no SyncUpdate/changes_tx in pipe mode
  • receive_and_relay_sync_message uses wrong peer state — no relay peer state in pipe mode

Handle methods still used from Tauri

Method Purpose
send_request() Daemon RPC (save, execute, kernel lifecycle, etc.)
receive_frontend_sync_message() Raw byte forward to daemon
notebook_id() ID accessor

No relay-local doc methods (get_metadata, set_metadata, get_doc_bytes, get_cells, etc.) are called from lib.rs.

Test plan

  • cargo test -p runtimed --lib — 234 passed
  • cargo test -p runtimed --test '*' — 15 integration tests passed
  • cargo test -p notebook --lib — 137 passed
  • Open existing .ipynb — cells load, kernel auto-launches
  • Cmd-N new notebook — kernel launches, execute works
  • Save-as — reconnects to new room
  • Session restore — windows restored
  • Trust approval — prompt appears for untrusted notebooks
  • Metadata writes (add dependency) — syncs to daemon and persists

rgbkrk added 3 commits March 7, 2026 22:19
…n requests

New NotebookRequest/NotebookResponse variants for reading the daemon's
canonical Automerge doc without going through the relay's local replica:

- GetDocBytes → DocBytes { bytes } — full doc for WASM bootstrap
- GetRawMetadata { key } → RawMetadata { value } — metadata JSON read
- SetRawMetadata { key, value } → MetadataSet — metadata JSON write + sync

SetRawMetadata notifies peers via changed_tx and persists via persist_tx.

These enable the Tauri relay to stop maintaining its own AutoCommit doc
for metadata and bootstrap operations (#600).
All metadata operations (get_metadata_snapshot, set_metadata_snapshot,
get_raw_metadata_additional, set_raw_trust_in_metadata) now use
send_request(GetRawMetadata/SetRawMetadata) instead of the relay's
local handle.get_metadata/set_metadata.

get_automerge_doc_bytes now uses send_request(GetDocBytes) to fetch
from the daemon's canonical doc instead of the relay's local replica.

The relay's local AutoCommit is no longer read or written for metadata
or doc bootstrap from the Tauri app. Remaining relay doc usage:
- receive_frontend_sync_message (merge + forward)
- SyncUpdate metadata diffing (changes_tx receiver)

These are addressed in subsequent commits.
When raw_sync_tx is present (Tauri app), the relay now acts as a
transparent byte pipe between frontend WASM and daemon:

- ReceiveFrontendSyncMessage: forwards raw sync bytes to daemon
  socket instead of decode → merge → re-encode → sync_to_daemon.
  The daemon processes the sync message and sends back a response
  frame, which arrives in the socket read branch.

- Daemon→frontend AutomergeSync frames: forwarded raw via
  raw_sync_tx instead of merge → generate_sync_message(fe_state).
  No local doc merge, no metadata diffing, no SyncUpdate.

- GetDocBytes: skips virtual sync handshake in pipe mode. The
  frontend gets doc bytes from the daemon via send_request(GetDocBytes),
  not from this command. Handshake preserved for runtimed-py.

This changes the sync topology from three Automerge peers
(frontend ↔ relay ↔ daemon) to two (frontend ↔ daemon) with
the relay as a transparent forwarder.

Eliminates these audit findings for the Tauri path:
- sync_to_daemon() broadcast dropping (relay doesn't call it)
- Triple-merge divergence risk (two peers now)
- Virtual sync handshake 10-iteration limit
- try_send drops on changes_tx (no SyncUpdate in pipe mode)
- biased select starvation (no merge delays)
- receive_and_relay_sync_message wrong peer state

runtimed-py retains the full peer mode unchanged.
@rgbkrk rgbkrk marked this pull request as ready for review March 8, 2026 06:38
@rgbkrk rgbkrk enabled auto-merge (squash) March 8, 2026 06:44
@rgbkrk rgbkrk merged commit 1973188 into main Mar 8, 2026
14 checks passed
@rgbkrk rgbkrk deleted the pure-relay branch March 8, 2026 06:46
rgbkrk added a commit that referenced this pull request Mar 8, 2026
In pipe mode (#608), the Automerge sync path doesn't deliver output
changes — the daemon's sync state tracks the relay peer, not the WASM
peer, so all sync frames arrive with changed=false. materializeCells
never runs after execution, and outputs never render.

Re-enable the broadcast-driven output path (appendOutput via onOutput
callback). The broadcast pipeline works correctly — outputs arrive,
blob manifests resolve, and the external store updates.

No duplicate risk: since sync frames have changed=false,
materializeCells doesn't run after execution, so there's only one
source of output updates (broadcasts).

The proper fix is to align the sync states so the daemon talks
directly to the WASM through the pipe (skip do_initial_sync in
pipe mode). Tracked as a follow-up.
rgbkrk added a commit that referenced this pull request Mar 8, 2026
In pipe mode (#608), the Automerge sync path doesn't deliver output
changes — the daemon's sync state tracks the relay peer, not the WASM
peer, so all sync frames arrive with changed=false. materializeCells
never runs after execution, and outputs never render.

Re-enable the broadcast-driven output path (appendOutput via onOutput
callback). The broadcast pipeline works correctly — outputs arrive,
blob manifests resolve, and the external store updates.

No duplicate risk: since sync frames have changed=false,
materializeCells doesn't run after execution, so there's only one
source of output updates (broadcasts).

The proper fix is to align the sync states so the daemon talks
directly to the WASM through the pipe (skip do_initial_sync in
pipe mode). Tracked as a follow-up.
rgbkrk added a commit that referenced this pull request Mar 8, 2026
…ing frames (#613) (#616)

* fix(runtimed): prevent stream corruption in pipe mode (#613)

In pipe mode, ReceiveFrontendSyncMessage was writing sync frames
directly to client.stream inside the select! command handler. If the
daemon was sending data at the same time, the select! would drop the
pending socket read future, then the command handler would write to the
stream, corrupting the framing. The daemon would then read payload bytes
as a length prefix, producing bogus frame sizes (observed: 1.15 GB).

Fix: buffer outgoing pipe frames in a VecDeque and flush them at the top
of the loop BEFORE entering select!. This ensures writes only happen when
no read is pending on the socket. The queue is drained synchronously
before the next select! iteration.

Full peer mode (runtimed-py) is unaffected — its writes go through
sync_to_daemon() which owns the read/write sequence.

* fix(notebook): re-enable broadcast-driven output rendering for pipe mode

In pipe mode (#608), the Automerge sync path doesn't deliver output
changes — the daemon's sync state tracks the relay peer, not the WASM
peer, so all sync frames arrive with changed=false. materializeCells
never runs after execution, and outputs never render.

Re-enable the broadcast-driven output path (appendOutput via onOutput
callback). The broadcast pipeline works correctly — outputs arrive,
blob manifests resolve, and the external store updates.

No duplicate risk: since sync frames have changed=false,
materializeCells doesn't run after execution, so there's only one
source of output updates (broadcasts).

The proper fix is to align the sync states so the daemon talks
directly to the WASM through the pipe (skip do_initial_sync in
pipe mode). Tracked as a follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: reduce NotebookSyncClient to pure relay — drop AutoCommit

1 participant