Skip to content

Route inbound approvals by SessionId so passivated/cold gateway trees don't drop the response #979

@Aaronontheweb

Description

@Aaronontheweb

Summary

Sibling to #939. Proposes a structurally different fix for the same symptom — silently-dropped approval responses — that targets the proximate cause observed in production: cascading passivation of the channel-adapter actor tree, where the per-thread binding correctly defers its own passivation but the channel-level parent passivates anyway and takes the child with it.

The proposed change makes inbound routing symmetric with outbound routing: both addressed by SessionId, both tolerant of any subset of the gateway tree being cold.

Observed incident

Real session reproduced from production daemon logs. The session is parked on a pending approval (shell_execute for a git commit) and is still alive 8h later.

Trigger sequence (IDs redacted):

T+0      slack-gateway/<channelId>/<threadTs>
         "Slack thread idle but 1 approval(s) are pending; deferring passivation"
         ← per-thread child correctly defers ITS OWN timer

T+1.1s   slack-gateway/<channelId>
         "Slack conversation idle for 2 hours, passivating"
         ← CHANNEL-LEVEL parent passivates, no awareness of child's pending state

T+1.1s   session-manager/.../<threadTs>
         "[...slack-gateway/.../<threadTs>/StreamSupervisor-N/
            slack-thread-0-0-actorRefSource] left"

T+1.1s   Dead letter to slack-gateway/<channelId>/<threadTs>
         AbruptTerminationException

For the next ~5h40m the session ticks session_observer_distill_skipped every 90s, oblivious that its slack output binding is gone. When the user finally clicks Approve in Slack:

T+5h40m  slack-gateway: "Routing Slack approval response for ... <callId>"
T+5h40m  slack-gateway/<channelId>: "Ignoring Slack approval response for missing thread <threadTs>"

Slack delivered the click correctly. Our re-spawned channel-level gateway had no in-memory record of the pending approval and dropped it.

Why this isn't fully covered by #939

  • Approval prompts become permanently stuck after daemon restart #939's fix path is "persist pending approvals, resurrect binding actors on demand, re-post UI on recovery." That works but bundles the routing fix with persistence work.
  • The proximate cause here is purely a cold-actor-tree problem within a single still-running daemon. The session actor is alive and addressable at a deterministic path the entire time. No persistence is required to deliver the response; only routing is broken.
  • Approval prompts become permanently stuck after daemon restart #939 also doesn't name the cascading-passivation bug: per-thread Slack thread idle but N approval(s) are pending; deferring passivation only blocks the child's own timer, not the channel-level parent's idle timer.

Architectural asymmetry being fixed

Direction Today
Outbound (LLM → user) Session actor holds the destination address; Akka.NET lazily re-spawns whatever's needed at that path. Survives passivation.
Inbound (user → LLM) Channel-level gateway consults an in-memory Dictionary<callId, perThreadChild>. Dictionary lost on passivation → response dropped.

The Slack block_actions payload + our existing button-value codec already carry everything needed to route inbound by SessionId:

  • channel.id → channel
  • message.thread_ts → thread
  • ApprovalButtonValueCodec decode → callId, optionKey, requesterSenderId
  • user.id → approver senderId (for CanApprove)
  • message.ts and response_url → for the redraw

That tuple deterministically resolves to session-manager/{persistenceId} + the specific pending call. No gateway-internal lookup is required.

(Note: ApprovalButtonValueCodec.MaxEncodedLength = 100 is correct as-is — Discord's custom_id cap is 100 — so the decode side must continue tolerating prefix-match against the pending-call set in scope. Under this proposal the match is scoped to a single session, so prefix-match is trivially safe.)

Proposed change

Move the routing boundary from the channel-level gateway to the session actor. Symmetric for Slack and Discord:

  1. Slack ingress (SlackConversationActor): on a block_actions payload with an approval action_id, decode the button value, build a self-contained ApprovalResponseReceived(sessionId, callId, optionKey, approvingSenderId, channel, messageTs, responseUrl) and Tell session-manager/{persistenceId} directly. Stop consulting the per-thread child for routing.
  2. Discord ingress (DiscordConversationActor): same shape, decode custom_id, build the same self-contained command, Tell the session.
  3. Session actor: on ApprovalResponseReceived, run CanApprove(requesterPrincipal, requesterSenderId, approvingSenderId) (it already owns this state), resolve the pending tool call, then send a self-contained RenderResolvedApproval(channel, messageTs, request, selectedKey, senderId) to the slack/discord output binding. Lazy spawn is fine — the binding doesn't need any prior in-memory state for the redraw because BuildResolvedApprovalBlocks(...) is pure.
  4. Per-thread bindings: drop _pendingApprovalRequests as a routing dependency. It can stay as a hint for "should I defer my own passivation?" but is no longer load-bearing for delivery.

Near-term mitigation (cheap, narrow)

If we want to reduce blast radius before the architectural change lands: have the channel-level slack-gateway/{channelId} (and Discord equivalent) consult its children's pending-approval state before passivating itself. Closes the specific cascading bug observed here without touching routing. Doesn't help cold-restart cases.

Relationship to #939

Complementary, not redundant.

Land this first → #939's Phase 2 ("resurrect binding on missing child") becomes unnecessary, and #939 narrows to its actual contribution: persistence of the TaskCompletionSource-equivalent state across process restart. The two together produce the full correctness guarantee.

Affected files (initial scan)

File Change
src/Netclaw.Channels.Slack/SlackConversationActor.cs Decode + route by SessionId; remove per-thread-child routing dependency for approvals
src/Netclaw.Channels.Slack/SlackThreadBindingActor.cs Drop _pendingApprovalRequests as routing dependency; keep as passivation-deferral hint
src/Netclaw.Channels.Discord/DiscordConversationActor.cs Symmetric to Slack
src/Netclaw.Channels.Discord/DiscordSessionBindingActor.cs Symmetric to Slack
src/Netclaw.Actors/Sessions/LlmSessionActor.cs Handle ApprovalResponseReceived; emit RenderResolvedApproval to output binding
src/Netclaw.Actors/Protocol/* New self-contained messages for inbound approval response and resolved-render command

Severity

Same as #939: production correctness. Sessions get permanently wedged with no UI signal that the click failed. Passivation happens routinely (channel-level 2h idle), so this fires far more often than full daemon restart.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingchannelsDiscord, Slack, and other channels.reliabilityRetries, resilience, graceful degradationsessionsLLM session actor, turn lifecycle, pipelines

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions