You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sibling to #939. Proposes a structurally different fix for the same symptom — silently-dropped approval responses — that targets the proximate cause observed in production: cascading passivation of the channel-adapter actor tree, where the per-thread binding correctly defers its own passivation but the channel-level parent passivates anyway and takes the child with it.
The proposed change makes inbound routing symmetric with outbound routing: both addressed by SessionId, both tolerant of any subset of the gateway tree being cold.
Observed incident
Real session reproduced from production daemon logs. The session is parked on a pending approval (shell_execute for a git commit) and is still alive 8h later.
Trigger sequence (IDs redacted):
T+0 slack-gateway/<channelId>/<threadTs>
"Slack thread idle but 1 approval(s) are pending; deferring passivation"
← per-thread child correctly defers ITS OWN timer
T+1.1s slack-gateway/<channelId>
"Slack conversation idle for 2 hours, passivating"
← CHANNEL-LEVEL parent passivates, no awareness of child's pending state
T+1.1s session-manager/.../<threadTs>
"[...slack-gateway/.../<threadTs>/StreamSupervisor-N/
slack-thread-0-0-actorRefSource] left"
T+1.1s Dead letter to slack-gateway/<channelId>/<threadTs>
AbruptTerminationException
For the next ~5h40m the session ticks session_observer_distill_skipped every 90s, oblivious that its slack output binding is gone. When the user finally clicks Approve in Slack:
T+5h40m slack-gateway: "Routing Slack approval response for ... <callId>"
T+5h40m slack-gateway/<channelId>: "Ignoring Slack approval response for missing thread <threadTs>"
Slack delivered the click correctly. Our re-spawned channel-level gateway had no in-memory record of the pending approval and dropped it.
The proximate cause here is purely a cold-actor-tree problem within a single still-running daemon. The session actor is alive and addressable at a deterministic path the entire time. No persistence is required to deliver the response; only routing is broken.
Approval prompts become permanently stuck after daemon restart #939 also doesn't name the cascading-passivation bug: per-thread Slack thread idle but N approval(s) are pending; deferring passivation only blocks the child's own timer, not the channel-level parent's idle timer.
Architectural asymmetry being fixed
Direction
Today
Outbound (LLM → user)
Session actor holds the destination address; Akka.NET lazily re-spawns whatever's needed at that path. Survives passivation.
Inbound (user → LLM)
Channel-level gateway consults an in-memory Dictionary<callId, perThreadChild>. Dictionary lost on passivation → response dropped.
The Slack block_actions payload + our existing button-value codec already carry everything needed to route inbound by SessionId:
That tuple deterministically resolves to session-manager/{persistenceId} + the specific pending call. No gateway-internal lookup is required.
(Note: ApprovalButtonValueCodec.MaxEncodedLength = 100 is correct as-is — Discord's custom_id cap is 100 — so the decode side must continue tolerating prefix-match against the pending-call set in scope. Under this proposal the match is scoped to a single session, so prefix-match is trivially safe.)
Proposed change
Move the routing boundary from the channel-level gateway to the session actor. Symmetric for Slack and Discord:
Slack ingress (SlackConversationActor): on a block_actions payload with an approval action_id, decode the button value, build a self-contained ApprovalResponseReceived(sessionId, callId, optionKey, approvingSenderId, channel, messageTs, responseUrl) and Tellsession-manager/{persistenceId} directly. Stop consulting the per-thread child for routing.
Discord ingress (DiscordConversationActor): same shape, decode custom_id, build the same self-contained command, Tell the session.
Session actor: on ApprovalResponseReceived, run CanApprove(requesterPrincipal, requesterSenderId, approvingSenderId) (it already owns this state), resolve the pending tool call, then send a self-contained RenderResolvedApproval(channel, messageTs, request, selectedKey, senderId) to the slack/discord output binding. Lazy spawn is fine — the binding doesn't need any prior in-memory state for the redraw because BuildResolvedApprovalBlocks(...) is pure.
Per-thread bindings: drop _pendingApprovalRequests as a routing dependency. It can stay as a hint for "should I defer my own passivation?" but is no longer load-bearing for delivery.
Near-term mitigation (cheap, narrow)
If we want to reduce blast radius before the architectural change lands: have the channel-level slack-gateway/{channelId} (and Discord equivalent) consult its children's pending-approval state before passivating itself. Closes the specific cascading bug observed here without touching routing. Doesn't help cold-restart cases.
Land this first → #939's Phase 2 ("resurrect binding on missing child") becomes unnecessary, and #939 narrows to its actual contribution: persistence of the TaskCompletionSource-equivalent state across process restart. The two together produce the full correctness guarantee.
Handle ApprovalResponseReceived; emit RenderResolvedApproval to output binding
src/Netclaw.Actors/Protocol/*
New self-contained messages for inbound approval response and resolved-render command
Severity
Same as #939: production correctness. Sessions get permanently wedged with no UI signal that the click failed. Passivation happens routinely (channel-level 2h idle), so this fires far more often than full daemon restart.
Summary
Sibling to #939. Proposes a structurally different fix for the same symptom — silently-dropped approval responses — that targets the proximate cause observed in production: cascading passivation of the channel-adapter actor tree, where the per-thread binding correctly defers its own passivation but the channel-level parent passivates anyway and takes the child with it.
The proposed change makes inbound routing symmetric with outbound routing: both addressed by
SessionId, both tolerant of any subset of the gateway tree being cold.Observed incident
Real session reproduced from production daemon logs. The session is parked on a pending approval (
shell_executefor agit commit) and is still alive 8h later.Trigger sequence (IDs redacted):
For the next ~5h40m the session ticks
session_observer_distill_skippedevery 90s, oblivious that its slack output binding is gone. When the user finally clicks Approve in Slack:Slack delivered the click correctly. Our re-spawned channel-level gateway had no in-memory record of the pending approval and dropped it.
Why this isn't fully covered by #939
Slack thread idle but N approval(s) are pending; deferring passivationonly blocks the child's own timer, not the channel-level parent's idle timer.Architectural asymmetry being fixed
Dictionary<callId, perThreadChild>. Dictionary lost on passivation → response dropped.The Slack
block_actionspayload + our existing button-value codec already carry everything needed to route inbound bySessionId:channel.id→ channelmessage.thread_ts→ threadApprovalButtonValueCodecdecode →callId,optionKey,requesterSenderIduser.id→ approver senderId (forCanApprove)message.tsandresponse_url→ for the redrawThat tuple deterministically resolves to
session-manager/{persistenceId}+ the specific pending call. No gateway-internal lookup is required.(Note:
ApprovalButtonValueCodec.MaxEncodedLength = 100is correct as-is — Discord'scustom_idcap is 100 — so the decode side must continue tolerating prefix-match against the pending-call set in scope. Under this proposal the match is scoped to a single session, so prefix-match is trivially safe.)Proposed change
Move the routing boundary from the channel-level gateway to the session actor. Symmetric for Slack and Discord:
SlackConversationActor): on ablock_actionspayload with an approvalaction_id, decode the button value, build a self-containedApprovalResponseReceived(sessionId, callId, optionKey, approvingSenderId, channel, messageTs, responseUrl)andTellsession-manager/{persistenceId}directly. Stop consulting the per-thread child for routing.DiscordConversationActor): same shape, decodecustom_id, build the same self-contained command,Tellthe session.ApprovalResponseReceived, runCanApprove(requesterPrincipal, requesterSenderId, approvingSenderId)(it already owns this state), resolve the pending tool call, then send a self-containedRenderResolvedApproval(channel, messageTs, request, selectedKey, senderId)to the slack/discord output binding. Lazy spawn is fine — the binding doesn't need any prior in-memory state for the redraw becauseBuildResolvedApprovalBlocks(...)is pure._pendingApprovalRequestsas a routing dependency. It can stay as a hint for "should I defer my own passivation?" but is no longer load-bearing for delivery.Near-term mitigation (cheap, narrow)
If we want to reduce blast radius before the architectural change lands: have the channel-level
slack-gateway/{channelId}(and Discord equivalent) consult its children's pending-approval state before passivating itself. Closes the specific cascading bug observed here without touching routing. Doesn't help cold-restart cases.Relationship to #939
Complementary, not redundant.
Land this first → #939's Phase 2 ("resurrect binding on missing child") becomes unnecessary, and #939 narrows to its actual contribution: persistence of the
TaskCompletionSource-equivalent state across process restart. The two together produce the full correctness guarantee.Affected files (initial scan)
src/Netclaw.Channels.Slack/SlackConversationActor.cssrc/Netclaw.Channels.Slack/SlackThreadBindingActor.cs_pendingApprovalRequestsas routing dependency; keep as passivation-deferral hintsrc/Netclaw.Channels.Discord/DiscordConversationActor.cssrc/Netclaw.Channels.Discord/DiscordSessionBindingActor.cssrc/Netclaw.Actors/Sessions/LlmSessionActor.csApprovalResponseReceived; emitRenderResolvedApprovalto output bindingsrc/Netclaw.Actors/Protocol/*Severity
Same as #939: production correctness. Sessions get permanently wedged with no UI signal that the click failed. Passivation happens routinely (channel-level 2h idle), so this fires far more often than full daemon restart.