Standardize channel infrastructure, add destination listing, fix lifecycle reliability (SPEC-015)#1375
Merged
Aaronontheweb merged 14 commits intoJun 10, 2026
Conversation
Consolidate per-channel test suites into shared contract bases under Channels/Contracts/, so every channel is held to the same behavioral bar: - ChannelHealthContractTests: IChannel.GetHealthAsync contract (3 shared tests) + SnapshotChannelHealthContractTests (2 more for channels with transport snapshots). Discord/Mattermost implement both; Slack joins the base only — its health model has no snapshot/detail (candidate for Phase 3 IGatewaySnapshot unification). Closes the Slack and Mattermost health coverage gaps. - GatewayLifecycleContractTests: all 7 union lifecycle behaviors now run for BOTH Discord and Mattermost (previously 2 and 4 partial tests respectively) using virtual time via TestScheduler — no opt-outs. Retry-timer cleanup is mutation-tested via dead-letter detection. - RoutingPolicyContractTests: 7 shared routing behaviors (mention gates, thread continuation, rehydration, DM matrix, empty content) normalized across all three channels. Platform-specific tests stay standalone; Mattermost's standalone file collapsed entirely into the contract. Deletes the superseded standalone test files; promotes FakeSlackApiClient and FakeHttpClientFactory to TestHelpers for reuse. Net -1,200 LOC of duplicated test code while increasing covered behaviors. Known issue surfaced for Phase 4: PublishConnectionRestoredAsync is unreachable on the auto-retry path in both lifecycle actors (retry passes ActorRefs.Nobody, so the isRetry branch never fires).
… Phase 3) DiscordGatewaySnapshot and MattermostGatewaySnapshot both carried (IsConnected, IsReady, HealthDetail) plus platform bot identity, and the two channels' GetHealthAsync bodies were character-for-character duplicates apart from operator-facing strings. - IGatewaySnapshot in Netclaw.Channels exposes the common health surface - GatewayChannelHealth.Evaluate centralizes ready/degraded/disconnected mapping with caller-supplied fallback strings, so each channel keeps its exact wording (zero behavioral drift) - Both snapshot records now implement the interface, unblocking the generic ChannelLifecycleActor (Phase 4)
Discord (1,163 LOC) and Mattermost (691 LOC) gateway lifecycle actors implemented the same state machine independently: Disconnected → Connecting → Ready with CleanReconnectRequired/Disconnecting, exponential backoff retry, snapshot queries, and PostStop timer cleanup. ChannelLifecycleActor<TSnapshot, TConnectCommand> in Netclaw.Channels now owns the state machine; subclasses provide transport start/stop, snapshot factories, event subscription, failure classification, and channel display-name strings. TConnectCommand carries each channel's connect credentials whole, so the retry path stores the original command and the shared GetSnapshot/Disconnect records resolve through inheritance — zero churn at gateway client call sites or in test fixtures. Genuine divergences stay in the subclasses: Discord's explicit READY event wait (30s IWithTimers timeout), fatal-close classification, and clean-reconnect-on-Connected-while-Ready; Mattermost's implicit ready-on-start and benign Connected-while-Ready handling. Net -1,015 LOC across the two actors (+718 documented, reusable base). Behavior preserved exactly, including the known dormant ConnectionRestored-never-fires-on-auto-retry issue, now marked with a comment for a separate fix. Lifecycle/health contract tests (29) pass unmodified across 3 consecutive runs; full suite 2,303 green.
…-015 Phase 5) The three AddXxxChannelIntegration extensions (Slack 134 / Mattermost 137 / Discord 113 LOC) each hand-rolled the same wiring: options binding, descriptor creation, keyed IChannel + IHostedService registration, files HTTP client with Netclaw headers, and enabled/disabled gating. AddRemoteChatChannel<TChannel, TOptions> + RemoteChatChannelBuilder now own that pattern; each channel declares its registrations as a fluent chain (~80 LOC) with exactly one WithServices escape hatch for SDK quirks (Mattermost's issue-netclaw-dev#1033 client fallback, Discord's socket client intents, Slack's SlackNet/event-handler block). Descriptors always register — even disabled — so the registry can list every channel; misconfigured chains throw at composition time. IRemoteChatChannelOptions gives channel-agnostic registration code the Enabled/AllowDirectMessages surface that drives CreateRemoteChat without knowing the concrete options type. Registration parity was verified by diffing ServiceDescriptor lists (legacy vs builder) per (ServiceType, ServiceKey) across enabled/ disabled/unconfigured/callback configs: Slack and Discord identical; Mattermost's reply/outbound clients moved from factory lambdas to equivalent type registrations. Total registration code grows slightly (384 → 508 LOC) because the 236-LOC builder is now reusable — marginal cost for channel netclaw-dev#4 drops to the ~80 LOC chain. Daemon tests 721 green, Actors tests 2,303 green, slopwatch clean.
ChannelGatewayActor<TChannelId> owns the shared gateway shape: bounded 4096-entry event-id dedup, get-or-create conversation children with Uri.EscapeDataString naming, and DeliverTrustedSessionTurn routing. Subclasses keep their channel-specific inbound receives so telemetry labels and log argument shapes stay byte-identical; Slack overrides OnMissingEventId to preserve its lenient empty-event-id acceptance. ChannelConversationActor<TMessage> owns the security pipeline in fixed order (ACL → bot filter → ingress gate → routing policy → empty filter → bind → forward), 2-hour passivation, stop supervision, and death-watch. The 12 abstract members are all projections (EvaluateAcl, EventIdOf, CreateThreadInbound, ...) — no logic hooks, so the gate ordering can no longer drift between channels. ChannelRoutingVerdict normalizes routing decisions while keeping log reasons and telemetry labels as separate axes. Slack's conversation actor stays concrete per the spec's Phase 6 risk gate: its pipeline order differs observably (routing policy before ingress gate; thread creation before the empty-text filter), it has no truncation or death-watch, and forcing ordering hooks into the base would make base+subclass more complex than the 283-line concrete actor. Slack's gateway migrated cleanly. Migrated actors: Mattermost gateway 164→117, Discord gateway 165→116, Slack gateway 165→119, Mattermost conversation 321→232, Discord conversation 361→270. Contract test bases and fixtures untouched; full suites green (Actors 2,303 / Daemon 721); routing contracts 3× stable.
…C-015 Phase 7) The spec's kitchen-sink-schema risk turned out to be moot: since the registration builder landed, SendSlackMessageTool / SendDiscordMessageTool / SendMattermostMessageTool were registered as concrete types consumed only by SendChannelMessageTool's internal switch — never IChannelTool-marked, never visible to the LLM. Their channel-specific parameters were dead surface, so full consolidation is free. - IChannelOutboundClient + ChannelSendRequest in Netclaw.Channels (spec sketch); SendChannelMessageTool dispatches by channel_key over the multi-bound clients — the hardcoded ChannelType switch and dictionary re-marshalling are gone - Per-channel logic moves verbatim into Slack/Discord/Mattermost ProactiveOutboundClient classes (423 → 371 LOC): every ACL check, AllowDirectMessages gate, default-channel comparison, and LLM-visible result/error string preserved exactly - The new clients sit above the ISlackOutboundClient-style SDK transports rather than merging into them — those interfaces are the fake seam in tests, and ACL must live above the mocked layer - Builder: WithSendTool → WithProactiveSendClient; duplicate channel-key client registrations now throw - send_channel_message name/description/schema byte-identical → no skill or eval changes required Tests: proactive-thread suites migrated to the clients (9 dead param-parsing/alias tests dropped), dispatch-routing and missing-adapter coverage added. Actors 2,294 / Daemon 723 green; slopwatch clean.
Answers "where can I post?" — previously the enabled-channel keys were
discoverable via the schema enum, but there was no way to enumerate the
destinations within a channel; agents had to guess names and recover from
errors.
- ChannelAddressResolutionStatus.Listed + Listed() factory: a listing is
not an Ambiguous match — an empty candidate list is a valid answer
("reachable, but nothing currently deliverable")
- IChannelAddressResolver.ListDestinationsAsync default interface method
returns a loud Unsupported; resolvers opt in. Listing is destination-only
by design: deliverable destinations are bounded (memberships, guild
channels, configured allowlists) while user directories are unbounded
and only make sense as server-side searches — lookup_channel_user keeps
requiring a query
- Slack: paginates the existing conversations listing through the same
SlackAclPolicy.IsAllowedChannel gate the search path uses
- Discord: enumerates cached guild text channels, ACL-filtered in the
resolver (the lookup client stays an unfiltered cache reflection)
- Mattermost: derived purely from configuration — its channel ACL is a
strict allowlist, so default channel + AllowedChannelIds IS the
deliverable set; no API surface needed
- lookup_channel_destination: blank/omitted query triggers the listing
(capped at 50 with an "…and N more" remainder); 'query' leaves the
schema's required array for this tool only
- netclaw-operations skill: discovery note added, version 2.10.4 → 2.11.0
Tests: resolver listing + ACL filtering per channel, tool-level listing /
empty / cap / unsupported / user-still-required / schema-required cases.
Actors 2,299 and Daemon 729 green; slopwatch clean.
Note: lookup_channel_destination's LLM-facing schema changed (query now
optional) — eval suite run recommended before release per CLAUDE.md.
…ntract Ports the dev hotfix (PR netclaw-dev#1374) into ChannelLifecycleActor: normalize ActorRefs.Nobody to null at the single reply-to storage point in StartConnecting. The auto-retry path connects with Nobody, and storing it verbatim made every pending-caller null check misfire — Discord's ready-timeout handler parked the actor in a permanent, silent CleanReconnectRequired zombie (the 0.24.0-beta.2 incident), and ConnectionRestored never published after auto-retry recoveries on either channel. The known-issue comment preserved during the Phase 4 extraction is replaced by the fix it predicted. Coverage: - GatewayLifecycleContractTests.Auto_reconnect_restores_ready now also asserts ConnectionRestored publishes exactly once on recovery, via a new ConnectionRestoredCount fixture hook (both channels) - DiscordGatewayLifecycleRetryTimeoutTests (same file as the dev hotfix, so the upcoming rebase converges): virtual-time replay of the incident — ready → clean reconnect → auto-retry → READY timeout → must emit a second clean-reconnect and recover when READY arrives. Settle barriers poll the post-stop health detail (written by the same handler that arms the retry timer) so virtual time never advances past an unarmed timer. All three assertions mutation-verified against the unfixed base. Lifecycle suite stable across 5 consecutive runs; full Actors suite 2,300 green; slopwatch clean.
Review finding: three openspec capability specs still normatively
mandated the per-channel send tools Phase 7 deleted, and SPEC-015 still
documented the pre-implementation builder sketch. Per the constitution,
planning artifacts are fixed when they conflict with implementation.
- Delta specs added to the active standardize-channel-delivery-contracts
change (validated with `openspec validate`):
- netclaw-discord-socket: proactive posting via send_channel_message
(explicit resolved destination, no tool-layer default-channel
fallback, fixed "Conversation" thread name); the channel-only
restriction is renamed/extended to cover DM destinations gated by
AllowDirectMessages + the user ACL
- netclaw-mattermost-socket: proactive sends via send_channel_message;
DM session wiring validates the USER ACL — ephemeral DM channel ids
are not required in the channel allowlist
- netclaw-scheduling: reminder Channel delivery references
send_channel_message instead of send_slack_message
- SPEC-015 §1.4 updated to the as-built RemoteChatChannelBuilder API
(ChannelType parameter, factory-based With* methods,
WithProactiveSendClient, WithServices escape hatch); §1.5 records the
Phase 7 outcome (per-channel tools were unreachable dead surface);
status updated
Review finding (pre-existing): every Mattermost proactive DM posted the message but failed session wiring — HandleProactiveThread gated the ephemeral DM channel id from OpenDmChannelAsync through the strict channel allowlist, which by construction never contains it, so the StartMattermostProactiveThread ask was always nacked and the tool always reported "session pipeline failed to initialize". The inbound ACL already exempted DMs from the channel check; the proactive path did not. Fix is Discord parity: StartMattermostProactiveThread carries DirectMessageUserId, and the conversation actor validates DMs against AllowDirectMessages + the user ACL (MattermostAclPolicy.IsAllowedUser) instead of the channel allowlist — ephemeral DM channel ids are transport plumbing, the user ACL is the authority. Channel-destination wiring is unchanged. Both proactive records also gain INoSerializationVerificationNeeded (Discord's already had it). The old test masked the bug with a FakeGatewayActor that acked unconditionally; new TestKit coverage drives the REAL conversation actor: DM+allowed user acks (fails pre-fix, mutation-verified), DM+disallowed user nacks, DMs-disabled nacks, non-allowlisted channel still nacks. Matches the netclaw-mattermost-socket delta spec added in the previous commit.
Review found two surviving siblings of the 0.24.0-beta.2 zombie plus two reliability defects, all in lifecycle code. Structural fix approved: - ReadySignalTimeout (TimeSpan?, null = start-implies-ready) moves the ready-timeout concept into ChannelLifecycleActor. The base arms and cancels the timer on every Connecting entry/exit; Discord overrides with 30s and drops IWithTimers. The four Discord-only hooks (OnConnectingEntered, CancelReadySignalTimer virtual, OnTransitionedToReady, OnIgnoredConnectWork) are deleted — ReadySignalTimedOut implements IGatewayConnectFailure so stale timers ride the existing wrong-behavior fail-the-ask path. - Both zombie branches collapse into one canonical transient-fail path: fail any pending ask, then RequestCleanReconnect → publish → clean stop (auto-reconnect preserved) → scheduled retry → Disconnected. The HasPendingConnect branching that produced the startup zombie and the dropped-retry zombie no longer exists; CleanReconnectRequired became unreachable and is deleted along with its handlers in both subclasses. - ConnectionRestored publishes on EVERY transition to Ready (the dead-ask window left channels Healthy-but-deaf: a >35s connect orphaned CompleteConnectionSetup forever). Discord/Mattermost channel setup is now guarded + idempotent since the ask reply and the event can race. - Retry backoff resets only after a stable Ready period (≥60s via TimeProvider): a flapping Mattermost server now walks the 5s→5min backoff instead of hot-looping with zero delay forever. All six behaviors mutation-verified (each new test fails with its fix reverted). Lifecycle contract suite 3× stable; deliberate assertion updates: initial connects now count one ConnectionRestored publish.
Two review findings on consolidation completeness:
- SessionIdFormat (in ChannelGatewayActor.cs) is now the single owner of
the '{channelId}/{threadKey}' session-id grammar: Build + TrySplit
with the exact original guards. The three token-identical gateway
parsers become 4-line typed wrappers; ChannelConversationActor.
BuildSessionId routes through it; 8 inline interpolated build sites
across conversation actors and proactive clients converted (webhook/
signalr/reminder ids are differently-shaped and deliberately left).
A future grammar change now has one production owner instead of 1
builder + 3 parsers + 8 stragglers, where a missed copy silently
nacked reminder/webhook delivery for one channel.
- ProactiveSendFormatting centralizes the LLM-visible result/error
strings and the 30s thread-ack timeout shared by the three proactive
outbound clients (Slack/Mattermost were 79-lines-identical twins);
Discord keeps its genuinely unique strings local. New
ProactiveOutboundClientContractTests base (7 behaviors × 3 fixtures)
pins the strings LITERALLY in the base — byte-parity fails CI if any
channel drifts. 12 per-channel tests removed as exact contract
duplicates; channel-specific coverage kept.
Bonus finding from writing the contract: the old per-channel
"successful send" tests asked a MinimalActorRef fake whose Ask always
faults, so they only ever exercised the pipeline-failed fallback. The
contract base runs a real responder actor under TestKit — the success
path is genuinely covered for the first time.
Five review findings on the registration/DI surface and the new listing: - Thread-history fetchers are now KEYED by channel: with 2+ channels enabled, MEDI's last-registration-wins handed Slack/Discord another channel's IThreadHistoryFetcher and thread rehydration silently backfilled empty. Channels resolve their own keyed fetcher via the builder's channel factory; regression test enables two channels and is mutation-verified against the unkeyed registration. - Generic send/lookup tools register from inside AddRemoteChatChannel (once, guarded) — the two hardcoded three-channel extensions with divergent config parsing are deleted, so a 4th channel can no longer silently boot without its tools. - Outbound clients become the registry's fifth keyed map with duplicate detection at construction (was: per-send IServiceProvider scan that let duplicate registrations boot clean and explode mid-conversation); SendChannelMessageTool dispatches via IChannelRegistry. - SlackTargetResolver now sees the runtime-resolved DefaultChannelId (DefaultChannelName-only configs previously made lookups/listing deny everything while sends to the same channel worked). Accessors stay lazy: the registry constructs clients eagerly, so eager channel capture would recurse channel→registry→client→channel. - Slack listing derives from config (resolved default + AllowedChannelIds, zero conversations.list calls — was O(workspace) paginated against a strict allowlist), resolves display names via conversations.info with raw-ID fallback, and skips archived channels; the search path now passes excludeArchived (archived channels are never deliverable, so they no longer resolve by name either). MattermostClient registration uses the same "unconfigured" placeholder as the SlackNet wiring above it — eager registry construction must not throw for enabled-but-tokenless channels; StartAsync still degrades loudly per issue netclaw-dev#1033. netclaw-operations skill discovery note updated to the config-derived semantics (2.11.0 → 2.11.1). Full solution suite green (Actors 2,326 / Daemon 733 / Cli 792 / +4 more projects); slopwatch clean.
Collaborator
Author
|
This is a continuation of #1326 |
The runbook still taught the pre-consolidation world: hand-written lifecycle/gateway/conversation actors, per-channel send tools, and AddXxxChannelIntegration DI extensions — all deleted by this branch. Rewritten as the operational walkthrough of the as-built design, with Mattermost as the canonical clone-and-adapt reference: the generic bases and what each subclass actually provides (ReadySignalTimeout decision, the canonical fail path, the 13 conversation projection hooks, SessionIdFormat grammar ownership, ProactiveSendFormatting byte-stable strings, the ACL-above-SDK-seam invariant, keyed thread-history fetchers, the lazy-accessor rule), the single builder chain including an explicit "the builder does this for you" list, the seven contract test suites a new channel inherits, and the definition-of-done gates. Every cited identifier (138) grep-verified against src/; numeric claims (dedup bound, passivation, truncation, stability window, timeouts) verified against source. CONTRIBUTING.md structure list gains the missing Mattermost project entry.
This was referenced Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements SPEC-015 (channel infrastructure standardization), adds blank-query destination discovery to
lookup_channel_destination, and fixes every confirmed finding from a max-effort review of the branch — including the two remaining siblings of the gateway zombie class that #1374 hotfixed.SPEC-015 consolidation (Phases 2–7)
IGatewaySnapshot+ sharedGatewayChannelHealth.EvaluateChannelLifecycleActor<TSnapshot, TConnectCommand>— Discord 1,163→~700, Mattermost 691→280 LOCRemoteChatChannelBuilderfluent registration (~80 LOC per channel, oneWithServicesescape hatch each)ChannelGatewayActor/ChannelConversationActor— security pipeline order can no longer drift between channels; Slack conversation stays concrete per the spec's risk gateIChannelOutboundClientsend consolidation — per-channel send tools were unreachable dead surface and are deleted;send_channel_messageschema byte-identicalA new remote chat channel now costs ~5 files plus a builder chain (runbook:
docs/runbooks/adding-a-channel.md).Destination discovery
lookup_channel_destinationwithqueryomitted lists every destination the channel can deliver to: Slack derives from config (resolved default channel + allowlist, display names viaconversations.info, archived excluded), Discord from the ACL-filtered guild cache, Mattermost from its allowlist.queryleft the required array for this tool only;lookup_channel_userstill requires one (user directories are unbounded).netclaw-operationsskill → 2.11.1.Reliability fixes (review-driven, all mutation-verified)
CleanReconnectRequiredparking state is deleted outright. Follow-up to Fix Discord gateway zombie state after failed auto-retry #1374, which fixed only the auto-retry leg.ConnectionRestoredpublishes on every transition to Ready; channel setup is idempotent. A >35s connect can no longer orphanMessageReceivedwiring forever.IThreadHistoryFetcheris keyed per channel — previously last-registration-won and thread rehydration silently backfilled empty for all but one channel.Behavioral notes for reviewers
excludeArchived: true) — they were never deliverable.standardize-channel-delivery-contractschange for the three capability specs that still mandated the deleted per-channel send tools (openspec validatepasses); SPEC-015 doc updated to the as-built API.lookup_channel_destinationschema change (needs a provider endpoint) — flagged per CLAUDE.md; recommend running before the next release tag.Validation
Full solution suite green: Actors 2,326 / Daemon 733 / Cli 792 / Security 566 / Configuration 409 / +4 more projects. Lifecycle contract suite stable across repeated runs; every behavioral fix has a regression test that fails with the fix reverted.
dotnet slopwatch analyzeclean; copyright headers verified.