Skip to content

Standardize channel infrastructure, add destination listing, fix lifecycle reliability (SPEC-015)#1375

Merged
Aaronontheweb merged 14 commits into
netclaw-dev:devfrom
Aaronontheweb:worktree-channel-consolidation
Jun 10, 2026
Merged

Standardize channel infrastructure, add destination listing, fix lifecycle reliability (SPEC-015)#1375
Aaronontheweb merged 14 commits into
netclaw-dev:devfrom
Aaronontheweb:worktree-channel-consolidation

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

Implements SPEC-015 (channel infrastructure standardization), adds blank-query destination discovery to lookup_channel_destination, and fixes every confirmed finding from a max-effort review of the branch — including the two remaining siblings of the gateway zombie class that #1374 hotfixed.

SPEC-015 consolidation (Phases 2–7)

Phase Change
2 Contract test bases: channel health, gateway lifecycle (virtual-time), routing policy — every channel held to the same behavioral bar; −1,200 LOC duplicated tests
3 IGatewaySnapshot + shared GatewayChannelHealth.Evaluate
4 Generic ChannelLifecycleActor<TSnapshot, TConnectCommand> — Discord 1,163→~700, Mattermost 691→280 LOC
5 RemoteChatChannelBuilder fluent registration (~80 LOC per channel, one WithServices escape hatch each)
6 Generic ChannelGatewayActor/ChannelConversationActor — security pipeline order can no longer drift between channels; Slack conversation stays concrete per the spec's risk gate
7 IChannelOutboundClient send consolidation — per-channel send tools were unreachable dead surface and are deleted; send_channel_message schema byte-identical

A new remote chat channel now costs ~5 files plus a builder chain (runbook: docs/runbooks/adding-a-channel.md).

Destination discovery

lookup_channel_destination with query omitted lists every destination the channel can deliver to: Slack derives from config (resolved default channel + allowlist, display names via conversations.info, archived excluded), Discord from the ACL-filtered guild cache, Mattermost from its allowlist. query left the required array for this tool only; lookup_channel_user still requires one (user directories are unbounded). netclaw-operations skill → 2.11.1.

Reliability fixes (review-driven, all mutation-verified)

  • Zombie class eliminated: the ready-signal timeout is now owned by the lifecycle base; the startup-path zombie (slow READY with a pending connect ask) and the dropped-retry zombie (retry scheduled into a state that couldn't receive it) collapse into one canonical fail path, and the CleanReconnectRequired parking state is deleted outright. Follow-up to Fix Discord gateway zombie state after failed auto-retry #1374, which fixed only the auto-retry leg.
  • Healthy-but-deaf window closed: ConnectionRestored publishes on every transition to Ready; channel setup is idempotent. A >35s connect can no longer orphan MessageReceived wiring forever.
  • Backoff stability window: retry backoff only resets after ≥60s of stable Ready, so a flapping server walks 5s→5min instead of hot-looping.
  • Multi-channel fetcher cross-wiring: IThreadHistoryFetcher is keyed per channel — previously last-registration-won and thread rehydration silently backfilled empty for all but one channel.
  • Mattermost proactive DMs fixed: DM session wiring validates the user ACL (Discord parity) instead of demanding the ephemeral DM channel id be allowlisted — proactive DMs previously always reported "session pipeline failed to initialize".
  • Registry-validated outbound clients: duplicate registrations now throw at boot, not mid-conversation; generic tools register via the builder so a 4th channel can't silently boot toolless.
  • Session-id grammar and proactive-send strings centralized; new proactive-outbound contract suite pins LLM-visible strings byte-for-byte across channels.

Behavioral notes for reviewers

  • Archived Slack channels no longer resolve via name search (excludeArchived: true) — they were never deliverable.
  • OpenSpec deltas added to the active standardize-channel-delivery-contracts change for the three capability specs that still mandated the deleted per-channel send tools (openspec validate passes); SPEC-015 doc updated to the as-built API.
  • Eval suite not yet run for the lookup_channel_destination schema change (needs a provider endpoint) — flagged per CLAUDE.md; recommend running before the next release tag.

Validation

Full solution suite green: Actors 2,326 / Daemon 733 / Cli 792 / Security 566 / Configuration 409 / +4 more projects. Lifecycle contract suite stable across repeated runs; every behavioral fix has a regression test that fails with the fix reverted. dotnet slopwatch analyze clean; copyright headers verified.

Consolidate per-channel test suites into shared contract bases under
Channels/Contracts/, so every channel is held to the same behavioral bar:

- ChannelHealthContractTests: IChannel.GetHealthAsync contract (3 shared
  tests) + SnapshotChannelHealthContractTests (2 more for channels with
  transport snapshots). Discord/Mattermost implement both; Slack joins
  the base only — its health model has no snapshot/detail (candidate for
  Phase 3 IGatewaySnapshot unification). Closes the Slack and Mattermost
  health coverage gaps.
- GatewayLifecycleContractTests: all 7 union lifecycle behaviors now run
  for BOTH Discord and Mattermost (previously 2 and 4 partial tests
  respectively) using virtual time via TestScheduler — no opt-outs.
  Retry-timer cleanup is mutation-tested via dead-letter detection.
- RoutingPolicyContractTests: 7 shared routing behaviors (mention gates,
  thread continuation, rehydration, DM matrix, empty content) normalized
  across all three channels. Platform-specific tests stay standalone;
  Mattermost's standalone file collapsed entirely into the contract.

Deletes the superseded standalone test files; promotes FakeSlackApiClient
and FakeHttpClientFactory to TestHelpers for reuse. Net -1,200 LOC of
duplicated test code while increasing covered behaviors.

Known issue surfaced for Phase 4: PublishConnectionRestoredAsync is
unreachable on the auto-retry path in both lifecycle actors (retry passes
ActorRefs.Nobody, so the isRetry branch never fires).
… Phase 3)

DiscordGatewaySnapshot and MattermostGatewaySnapshot both carried
(IsConnected, IsReady, HealthDetail) plus platform bot identity, and the
two channels' GetHealthAsync bodies were character-for-character
duplicates apart from operator-facing strings.

- IGatewaySnapshot in Netclaw.Channels exposes the common health surface
- GatewayChannelHealth.Evaluate centralizes ready/degraded/disconnected
  mapping with caller-supplied fallback strings, so each channel keeps
  its exact wording (zero behavioral drift)
- Both snapshot records now implement the interface, unblocking the
  generic ChannelLifecycleActor (Phase 4)
Discord (1,163 LOC) and Mattermost (691 LOC) gateway lifecycle actors
implemented the same state machine independently: Disconnected →
Connecting → Ready with CleanReconnectRequired/Disconnecting, exponential
backoff retry, snapshot queries, and PostStop timer cleanup.

ChannelLifecycleActor<TSnapshot, TConnectCommand> in Netclaw.Channels now
owns the state machine; subclasses provide transport start/stop, snapshot
factories, event subscription, failure classification, and channel
display-name strings. TConnectCommand carries each channel's connect
credentials whole, so the retry path stores the original command and the
shared GetSnapshot/Disconnect records resolve through inheritance — zero
churn at gateway client call sites or in test fixtures.

Genuine divergences stay in the subclasses: Discord's explicit READY
event wait (30s IWithTimers timeout), fatal-close classification, and
clean-reconnect-on-Connected-while-Ready; Mattermost's implicit
ready-on-start and benign Connected-while-Ready handling.

Net -1,015 LOC across the two actors (+718 documented, reusable base).
Behavior preserved exactly, including the known dormant
ConnectionRestored-never-fires-on-auto-retry issue, now marked with a
comment for a separate fix. Lifecycle/health contract tests (29) pass
unmodified across 3 consecutive runs; full suite 2,303 green.
…-015 Phase 5)

The three AddXxxChannelIntegration extensions (Slack 134 / Mattermost 137 /
Discord 113 LOC) each hand-rolled the same wiring: options binding,
descriptor creation, keyed IChannel + IHostedService registration, files
HTTP client with Netclaw headers, and enabled/disabled gating.

AddRemoteChatChannel<TChannel, TOptions> + RemoteChatChannelBuilder now
own that pattern; each channel declares its registrations as a fluent
chain (~80 LOC) with exactly one WithServices escape hatch for SDK quirks
(Mattermost's issue-netclaw-dev#1033 client fallback, Discord's socket client
intents, Slack's SlackNet/event-handler block). Descriptors always
register — even disabled — so the registry can list every channel;
misconfigured chains throw at composition time.

IRemoteChatChannelOptions gives channel-agnostic registration code the
Enabled/AllowDirectMessages surface that drives CreateRemoteChat without
knowing the concrete options type.

Registration parity was verified by diffing ServiceDescriptor lists
(legacy vs builder) per (ServiceType, ServiceKey) across enabled/
disabled/unconfigured/callback configs: Slack and Discord identical;
Mattermost's reply/outbound clients moved from factory lambdas to
equivalent type registrations. Total registration code grows slightly
(384 → 508 LOC) because the 236-LOC builder is now reusable — marginal
cost for channel netclaw-dev#4 drops to the ~80 LOC chain.

Daemon tests 721 green, Actors tests 2,303 green, slopwatch clean.
ChannelGatewayActor<TChannelId> owns the shared gateway shape: bounded
4096-entry event-id dedup, get-or-create conversation children with
Uri.EscapeDataString naming, and DeliverTrustedSessionTurn routing.
Subclasses keep their channel-specific inbound receives so telemetry
labels and log argument shapes stay byte-identical; Slack overrides
OnMissingEventId to preserve its lenient empty-event-id acceptance.

ChannelConversationActor<TMessage> owns the security pipeline in fixed
order (ACL → bot filter → ingress gate → routing policy → empty filter →
bind → forward), 2-hour passivation, stop supervision, and death-watch.
The 12 abstract members are all projections (EvaluateAcl, EventIdOf,
CreateThreadInbound, ...) — no logic hooks, so the gate ordering can no
longer drift between channels. ChannelRoutingVerdict normalizes routing
decisions while keeping log reasons and telemetry labels as separate axes.

Slack's conversation actor stays concrete per the spec's Phase 6 risk
gate: its pipeline order differs observably (routing policy before
ingress gate; thread creation before the empty-text filter), it has no
truncation or death-watch, and forcing ordering hooks into the base
would make base+subclass more complex than the 283-line concrete actor.
Slack's gateway migrated cleanly.

Migrated actors: Mattermost gateway 164→117, Discord gateway 165→116,
Slack gateway 165→119, Mattermost conversation 321→232, Discord
conversation 361→270. Contract test bases and fixtures untouched; full
suites green (Actors 2,303 / Daemon 721); routing contracts 3× stable.
…C-015 Phase 7)

The spec's kitchen-sink-schema risk turned out to be moot: since the
registration builder landed, SendSlackMessageTool / SendDiscordMessageTool /
SendMattermostMessageTool were registered as concrete types consumed only
by SendChannelMessageTool's internal switch — never IChannelTool-marked,
never visible to the LLM. Their channel-specific parameters were dead
surface, so full consolidation is free.

- IChannelOutboundClient + ChannelSendRequest in Netclaw.Channels (spec
  sketch); SendChannelMessageTool dispatches by channel_key over the
  multi-bound clients — the hardcoded ChannelType switch and dictionary
  re-marshalling are gone
- Per-channel logic moves verbatim into Slack/Discord/Mattermost
  ProactiveOutboundClient classes (423 → 371 LOC): every ACL check,
  AllowDirectMessages gate, default-channel comparison, and LLM-visible
  result/error string preserved exactly
- The new clients sit above the ISlackOutboundClient-style SDK transports
  rather than merging into them — those interfaces are the fake seam in
  tests, and ACL must live above the mocked layer
- Builder: WithSendTool → WithProactiveSendClient; duplicate channel-key
  client registrations now throw
- send_channel_message name/description/schema byte-identical → no skill
  or eval changes required

Tests: proactive-thread suites migrated to the clients (9 dead
param-parsing/alias tests dropped), dispatch-routing and missing-adapter
coverage added. Actors 2,294 / Daemon 723 green; slopwatch clean.
Answers "where can I post?" — previously the enabled-channel keys were
discoverable via the schema enum, but there was no way to enumerate the
destinations within a channel; agents had to guess names and recover from
errors.

- ChannelAddressResolutionStatus.Listed + Listed() factory: a listing is
  not an Ambiguous match — an empty candidate list is a valid answer
  ("reachable, but nothing currently deliverable")
- IChannelAddressResolver.ListDestinationsAsync default interface method
  returns a loud Unsupported; resolvers opt in. Listing is destination-only
  by design: deliverable destinations are bounded (memberships, guild
  channels, configured allowlists) while user directories are unbounded
  and only make sense as server-side searches — lookup_channel_user keeps
  requiring a query
- Slack: paginates the existing conversations listing through the same
  SlackAclPolicy.IsAllowedChannel gate the search path uses
- Discord: enumerates cached guild text channels, ACL-filtered in the
  resolver (the lookup client stays an unfiltered cache reflection)
- Mattermost: derived purely from configuration — its channel ACL is a
  strict allowlist, so default channel + AllowedChannelIds IS the
  deliverable set; no API surface needed
- lookup_channel_destination: blank/omitted query triggers the listing
  (capped at 50 with an "…and N more" remainder); 'query' leaves the
  schema's required array for this tool only
- netclaw-operations skill: discovery note added, version 2.10.4 → 2.11.0

Tests: resolver listing + ACL filtering per channel, tool-level listing /
empty / cap / unsupported / user-still-required / schema-required cases.
Actors 2,299 and Daemon 729 green; slopwatch clean.

Note: lookup_channel_destination's LLM-facing schema changed (query now
optional) — eval suite run recommended before release per CLAUDE.md.
…ntract

Ports the dev hotfix (PR netclaw-dev#1374) into ChannelLifecycleActor: normalize
ActorRefs.Nobody to null at the single reply-to storage point in
StartConnecting. The auto-retry path connects with Nobody, and storing it
verbatim made every pending-caller null check misfire — Discord's
ready-timeout handler parked the actor in a permanent, silent
CleanReconnectRequired zombie (the 0.24.0-beta.2 incident), and
ConnectionRestored never published after auto-retry recoveries on either
channel. The known-issue comment preserved during the Phase 4 extraction
is replaced by the fix it predicted.

Coverage:
- GatewayLifecycleContractTests.Auto_reconnect_restores_ready now also
  asserts ConnectionRestored publishes exactly once on recovery, via a
  new ConnectionRestoredCount fixture hook (both channels)
- DiscordGatewayLifecycleRetryTimeoutTests (same file as the dev hotfix,
  so the upcoming rebase converges): virtual-time replay of the incident
  — ready → clean reconnect → auto-retry → READY timeout → must emit a
  second clean-reconnect and recover when READY arrives. Settle barriers
  poll the post-stop health detail (written by the same handler that
  arms the retry timer) so virtual time never advances past an unarmed
  timer.

All three assertions mutation-verified against the unfixed base.
Lifecycle suite stable across 5 consecutive runs; full Actors suite
2,300 green; slopwatch clean.
Review finding: three openspec capability specs still normatively
mandated the per-channel send tools Phase 7 deleted, and SPEC-015 still
documented the pre-implementation builder sketch. Per the constitution,
planning artifacts are fixed when they conflict with implementation.

- Delta specs added to the active standardize-channel-delivery-contracts
  change (validated with `openspec validate`):
  - netclaw-discord-socket: proactive posting via send_channel_message
    (explicit resolved destination, no tool-layer default-channel
    fallback, fixed "Conversation" thread name); the channel-only
    restriction is renamed/extended to cover DM destinations gated by
    AllowDirectMessages + the user ACL
  - netclaw-mattermost-socket: proactive sends via send_channel_message;
    DM session wiring validates the USER ACL — ephemeral DM channel ids
    are not required in the channel allowlist
  - netclaw-scheduling: reminder Channel delivery references
    send_channel_message instead of send_slack_message
- SPEC-015 §1.4 updated to the as-built RemoteChatChannelBuilder API
  (ChannelType parameter, factory-based With* methods,
  WithProactiveSendClient, WithServices escape hatch); §1.5 records the
  Phase 7 outcome (per-channel tools were unreachable dead surface);
  status updated
Review finding (pre-existing): every Mattermost proactive DM posted the
message but failed session wiring — HandleProactiveThread gated the
ephemeral DM channel id from OpenDmChannelAsync through the strict
channel allowlist, which by construction never contains it, so the
StartMattermostProactiveThread ask was always nacked and the tool always
reported "session pipeline failed to initialize". The inbound ACL
already exempted DMs from the channel check; the proactive path did not.

Fix is Discord parity: StartMattermostProactiveThread carries
DirectMessageUserId, and the conversation actor validates DMs against
AllowDirectMessages + the user ACL (MattermostAclPolicy.IsAllowedUser)
instead of the channel allowlist — ephemeral DM channel ids are
transport plumbing, the user ACL is the authority. Channel-destination
wiring is unchanged. Both proactive records also gain
INoSerializationVerificationNeeded (Discord's already had it).

The old test masked the bug with a FakeGatewayActor that acked
unconditionally; new TestKit coverage drives the REAL conversation
actor: DM+allowed user acks (fails pre-fix, mutation-verified),
DM+disallowed user nacks, DMs-disabled nacks, non-allowlisted channel
still nacks. Matches the netclaw-mattermost-socket delta spec added in
the previous commit.
Review found two surviving siblings of the 0.24.0-beta.2 zombie plus two
reliability defects, all in lifecycle code. Structural fix approved:

- ReadySignalTimeout (TimeSpan?, null = start-implies-ready) moves the
  ready-timeout concept into ChannelLifecycleActor. The base arms and
  cancels the timer on every Connecting entry/exit; Discord overrides
  with 30s and drops IWithTimers. The four Discord-only hooks
  (OnConnectingEntered, CancelReadySignalTimer virtual,
  OnTransitionedToReady, OnIgnoredConnectWork) are deleted —
  ReadySignalTimedOut implements IGatewayConnectFailure so stale timers
  ride the existing wrong-behavior fail-the-ask path.
- Both zombie branches collapse into one canonical transient-fail path:
  fail any pending ask, then RequestCleanReconnect → publish → clean
  stop (auto-reconnect preserved) → scheduled retry → Disconnected. The
  HasPendingConnect branching that produced the startup zombie and the
  dropped-retry zombie no longer exists; CleanReconnectRequired became
  unreachable and is deleted along with its handlers in both subclasses.
- ConnectionRestored publishes on EVERY transition to Ready (the dead-ask
  window left channels Healthy-but-deaf: a >35s connect orphaned
  CompleteConnectionSetup forever). Discord/Mattermost channel setup is
  now guarded + idempotent since the ask reply and the event can race.
- Retry backoff resets only after a stable Ready period (≥60s via
  TimeProvider): a flapping Mattermost server now walks the 5s→5min
  backoff instead of hot-looping with zero delay forever.

All six behaviors mutation-verified (each new test fails with its fix
reverted). Lifecycle contract suite 3× stable; deliberate assertion
updates: initial connects now count one ConnectionRestored publish.
Two review findings on consolidation completeness:

- SessionIdFormat (in ChannelGatewayActor.cs) is now the single owner of
  the '{channelId}/{threadKey}' session-id grammar: Build + TrySplit
  with the exact original guards. The three token-identical gateway
  parsers become 4-line typed wrappers; ChannelConversationActor.
  BuildSessionId routes through it; 8 inline interpolated build sites
  across conversation actors and proactive clients converted (webhook/
  signalr/reminder ids are differently-shaped and deliberately left).
  A future grammar change now has one production owner instead of 1
  builder + 3 parsers + 8 stragglers, where a missed copy silently
  nacked reminder/webhook delivery for one channel.

- ProactiveSendFormatting centralizes the LLM-visible result/error
  strings and the 30s thread-ack timeout shared by the three proactive
  outbound clients (Slack/Mattermost were 79-lines-identical twins);
  Discord keeps its genuinely unique strings local. New
  ProactiveOutboundClientContractTests base (7 behaviors × 3 fixtures)
  pins the strings LITERALLY in the base — byte-parity fails CI if any
  channel drifts. 12 per-channel tests removed as exact contract
  duplicates; channel-specific coverage kept.

Bonus finding from writing the contract: the old per-channel
"successful send" tests asked a MinimalActorRef fake whose Ask always
faults, so they only ever exercised the pipeline-failed fallback. The
contract base runs a real responder actor under TestKit — the success
path is genuinely covered for the first time.
Five review findings on the registration/DI surface and the new listing:

- Thread-history fetchers are now KEYED by channel: with 2+ channels
  enabled, MEDI's last-registration-wins handed Slack/Discord another
  channel's IThreadHistoryFetcher and thread rehydration silently
  backfilled empty. Channels resolve their own keyed fetcher via the
  builder's channel factory; regression test enables two channels and is
  mutation-verified against the unkeyed registration.
- Generic send/lookup tools register from inside AddRemoteChatChannel
  (once, guarded) — the two hardcoded three-channel extensions with
  divergent config parsing are deleted, so a 4th channel can no longer
  silently boot without its tools.
- Outbound clients become the registry's fifth keyed map with duplicate
  detection at construction (was: per-send IServiceProvider scan that
  let duplicate registrations boot clean and explode mid-conversation);
  SendChannelMessageTool dispatches via IChannelRegistry.
- SlackTargetResolver now sees the runtime-resolved DefaultChannelId
  (DefaultChannelName-only configs previously made lookups/listing deny
  everything while sends to the same channel worked). Accessors stay
  lazy: the registry constructs clients eagerly, so eager channel
  capture would recurse channel→registry→client→channel.
- Slack listing derives from config (resolved default + AllowedChannelIds,
  zero conversations.list calls — was O(workspace) paginated against a
  strict allowlist), resolves display names via conversations.info with
  raw-ID fallback, and skips archived channels; the search path now
  passes excludeArchived (archived channels are never deliverable, so
  they no longer resolve by name either).

MattermostClient registration uses the same "unconfigured" placeholder
as the SlackNet wiring above it — eager registry construction must not
throw for enabled-but-tokenless channels; StartAsync still degrades
loudly per issue netclaw-dev#1033. netclaw-operations skill discovery note updated
to the config-derived semantics (2.11.0 → 2.11.1).

Full solution suite green (Actors 2,326 / Daemon 733 / Cli 792 / +4
more projects); slopwatch clean.
@Aaronontheweb Aaronontheweb added channels Discord, Slack, and other channels. cleanup Code quality improvements and tech debt reduction labels Jun 10, 2026
@Aaronontheweb

Copy link
Copy Markdown
Collaborator Author

This is a continuation of #1326

The runbook still taught the pre-consolidation world: hand-written
lifecycle/gateway/conversation actors, per-channel send tools, and
AddXxxChannelIntegration DI extensions — all deleted by this branch.

Rewritten as the operational walkthrough of the as-built design, with
Mattermost as the canonical clone-and-adapt reference: the generic bases
and what each subclass actually provides (ReadySignalTimeout decision,
the canonical fail path, the 13 conversation projection hooks,
SessionIdFormat grammar ownership, ProactiveSendFormatting byte-stable
strings, the ACL-above-SDK-seam invariant, keyed thread-history
fetchers, the lazy-accessor rule), the single builder chain including an
explicit "the builder does this for you" list, the seven contract test
suites a new channel inherits, and the definition-of-done gates.

Every cited identifier (138) grep-verified against src/; numeric claims
(dedup bound, passivation, truncation, stability window, timeouts)
verified against source. CONTRIBUTING.md structure list gains the
missing Mattermost project entry.
@Aaronontheweb Aaronontheweb merged commit b47dff3 into netclaw-dev:dev Jun 10, 2026
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channels Discord, Slack, and other channels. cleanup Code quality improvements and tech debt reduction

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant