Skip to content

fix: stream everywhere for auxiliary LLM calls + session id on LLM logs#1309

Closed
Aaronontheweb wants to merge 4 commits into
netclaw-dev:devfrom
Aaronontheweb:claude-wt-openai-memory-formation-errors
Closed

fix: stream everywhere for auxiliary LLM calls + session id on LLM logs#1309
Aaronontheweb wants to merge 4 commits into
netclaw-dev:devfrom
Aaronontheweb:claude-wt-openai-memory-formation-errors

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

What

Two independent fixes behind the OpenAI Codex memory-formation errors (400 {"detail":"Stream must be set to true"}).

1. Stream everywhere at the provider boundary (supersedes #1289)

The Codex backend is streaming-only — it rejects non-streaming Responses requests. Netclaw's interactive session loop streams, but auxiliary calls (title generation, memory distillation, compaction, curation) use the non-streaming IChatClient.GetResponseAsync path, so they 400 against Codex while the main loop works fine.

Rather than a per-provider capability model, make streaming the single request mode the daemon issues at the provider boundary: a new StreamFirstChatClient (DelegatingChatClient) serves GetResponseAsync by aggregating the streaming endpoint (GetStreamingResponseAsync(...).ToChatResponseAsync(...)), applied to every client in ProviderPluginFactory.Create. Streaming is near-universal among supported providers, and any environment that breaks streaming would already break the main session loop (which streams too).

  • Empty-stream guard: an aggregated empty stream yields zero messages, but sidecar callers index Messages[^1]. StreamFirstChatClient appends an empty assistant message in that case, matching StreamingResponseReader's existing guard, so an empty completion degrades to empty text instead of an unguarded index throw.

2. Session id on LLM logs in Seq

LLM client logs (LoggingChatClient, FailoverChatClient) reached Seq with no session correlation: the id lives in SessionDiagnosticsContext (an AsyncLocal pushed by every session-owned path) but the OTLP exporter never projected it onto the LogRecord. A shared SessionLoggingScope opens a session.id logging scope (read from the ambient context) around the LLM calls and the failover/outage log lines, riding the already-enabled IncludeScopes so it surfaces as a filterable Seq attribute — no telemetry-wiring change.

Testing

  • StreamFirstChatClient: aggregation + streaming passthrough + empty-stream guard.
  • LoggingChatClient / FailoverChatClient: session.id scope attached (and absent when no session is in scope).
  • Daemon Configuration suite: 119/119 passing; build clean (0 warnings); slopwatch + file headers clean.

Notes

  • Supersedes fix(openai): serve non-streaming Codex calls via streaming (stop-gap) #1289 (the Codex-specific stop-gap) — the stream→aggregate wrapper is now universal rather than a single inline special case.
  • A pre-existing, separate diagnostic-only quirk was found and deliberately left alone: LoggingChatClient's delta/cumulative token counters are mutable state on a singleton, so those two values in the LLM call completed debug-log line mix sessions. They feed nothing else — the real per-response usage that drives chat -p, OTel metrics, stats, the session catalog, and compaction all comes from the per-session LlmSessionActor and is unaffected. Worth a separate cleanup, not this PR.

Auxiliary LLM calls (title generation, memory distillation, compaction) used the non-streaming IChatClient.GetResponseAsync path, which the OpenAI Codex backend rejects with 400 {"detail":"Stream must be set to true"}. The interactive session loop already streams, so only the sidecars broke.

Make streaming the single request mode at the provider boundary: a new StreamFirstChatClient serves GetResponseAsync by aggregating the streaming endpoint (GetStreamingResponseAsync(...).ToChatResponseAsync(...)), applied to every client in ProviderPluginFactory.Create. No per-provider capability model -- streaming is near-universal among supported providers, and any environment that breaks streaming would already break the main loop.

Separately, these LLM client logs reached Seq with no session id: the id lives in SessionDiagnosticsContext (AsyncLocal) but the OTLP exporter never projected it onto the LogRecord. Fix it at the logger -- LoggingChatClient now opens a "session.id" logging scope from the ambient context around both response paths, riding the already-enabled IncludeScopes so the id surfaces as a filterable Seq attribute with no telemetry changes.

Supersedes netclaw-dev#1289.
… doc

A review found StreamFirstChatClient.GetResponseAsync could return a ChatResponse with zero messages when the underlying stream yields no updates (empty/aborted completion). Auxiliary callers index Messages[^1] (title generation, memory distillation, compaction), so an empty stream was silent data loss. Make GetResponseAsync async and append an empty assistant message when the aggregate is empty, matching StreamingResponseReader's guard; add a regression test.

Also correct the BeginSessionScope doc comment: the session.id scope only adds correlation on the OTLP export path, not every sink (the rolling-file provider ignores scopes and routes session.log independently).
Extract SessionLoggingScope so the session.id logging scope is shared, and route FailoverChatClient's failover/outage log lines through it — those are emitted outside LoggingChatClient's scope and previously reached Seq without session correlation. (AlertingChatClientDecorator emits only notification-sink alerts, no MEL logs, so needs nothing.)

Use a single-element KeyValuePair array instead of a per-call Dictionary for the scope state. Test cleanup: share one ScopeCapturingLogger across the chat-client tests; add a failover session-id test.
@Aaronontheweb Aaronontheweb added providers Provider integrations and capability detection across OpenAI-compatible backends. reliability Retries, resilience, graceful degradation labels Jun 3, 2026

@Aaronontheweb Aaronontheweb left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs work

/// </summary>
internal static class SessionLoggingScope
{
private const string SessionIdKey = "session.id";

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this consistent with where we've done it elsewhere?

if (response.Messages.Count == 0)
response.Messages.Add(new ChatMessage(ChatRole.Assistant, []));

return response;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bad design - better to have everything consume the IAsyncEnumerable than to hide it in a long-running Task like this. Inverse of what we should be doing here.

@Aaronontheweb Aaronontheweb marked this pull request as draft June 3, 2026 07:47
@Aaronontheweb

Copy link
Copy Markdown
Collaborator Author

Converting to draft — pausing this approach.

The Codex 400 {"detail":"Stream must be set to true"} is a symptom of broader debt in the chat-client stack rather than a one-off: the streaming vs non-streaming split leaks into every caller (the main session loop and SubAgentActor already independently switched to streaming — see the Qwen-reasoning comment in SubAgentActor), retry is coupled to transport (RetryingChatClient only retries the non-streaming path), LoggingChatClient carries per-session-intended mutable state on a process singleton, and the decorator order is assembled ad hoc. The StreamFirstChatClient wrapper here is another per-caller workaround on top of that.

Rather than pile on, we're doing a focused design pass on the chat-client stack first. This PR (and #1289) will be superseded by that work.

Aaronontheweb added a commit that referenced this pull request Jun 4, 2026
…outing (#1313)

* feat(providers): streaming-native chat-client stack with composable routing

Redesign the IChatClient stack so the streaming-vs-non-streaming transport distinction no longer leaks into callers, and resilience/observability are no longer entangled with transport. Supersedes #1289 and #1309.

Transport: Netclaw now issues only streaming requests. RetryingChatClient retries the streaming path (pre-first-chunk; a mid-stream failure propagates so emitted output is never duplicated), reusing RetryPolicy. The auxiliary callers (title generation, memory distillation, compaction observation, memory curation, memory extraction) aggregate the stream via the existing StreamingResponseReader (empty-Messages guard + diagnostics) and read response.Text, instead of the non-streaming GetResponseAsync path that the OpenAI Codex backend rejects (400 'Stream must be set to true') and that drops reasoning content on some providers.

Composition: per-(provider,model) pipelines are composed with Microsoft.Extensions.AI ChatClientBuilder (Logging -> Retry -> VendorOptions -> raw) in PipelineChatClientFactory; the .Use ordering is asserted by test.

Routing: a router selects which composed pipeline to invoke per call. RoutingChatClient walks an ordered candidate list, so failover is 'more than one candidate' and single-provider alerting is the terminal failure of a one-entry walk -- absorbing and deleting FailoverChatClient, AlertingChatClientDecorator, NetclawChatClientProvider and ResilientChatClientProviderDecorator. The router takes a ChatRoutingContext so future per-session/per-provider routing slots in as a new policy; today's RoleBasedFailoverRouter reproduces the prior wiring.

Logging: LoggingChatClient is now stateless (the per-session-intended delta/cumulative token counters on the process singleton are gone) and tags log lines with the ambient session id via SessionLoggingScope, keyed 'SessionId' to match the WithContext('SessionId', ...) key used across actors so one Seq attribute correlates actor and chat-client logs.

Verified: full solution build (0 warnings); Netclaw.Daemon.Tests 668/668; Netclaw.Actors.Tests 2186/2186; slopwatch clean; file headers present.

* fix(providers): review fixes — SessionId on retry logs, drop AttemptIndex, streaming-failover tests

Self-review of the stack redesign found three issues:

1. RetryingChatClient.BackoffAsync logged retry warnings without the SessionId scope, so the hottest failure path reached Seq uncorrelated while LoggingChatClient/RoutingChatClient logs carried it. Open the scope around the warning.

2. ChatRoutingContext.AttemptIndex was structurally unfulfillable: the per-role RoutingChatClient context is frozen at construction and the router is called once per call (not per candidate), so it could never be non-zero. Removed it; SessionId remains the genuine per-session-routing seam.

3. The streaming terminal-unreachable branch of RoutingChatClient (last candidate fails pre-first-chunk -> provider.unreachable + rethrow) and single-candidate / streaming-cancellation behavior were untested. Added three streaming tests.

* refactor(providers): single retry layer — transport owns LLM transient retry

Reconcile the two pre-first-chunk retry layers (review follow-up): the actor's per-turn streaming retry and the transport's RetryingChatClient overlapped on the main session loop. Make the transport the single owner.

RetryPolicy.ShouldRetry now also recognizes the curated ProviderException (StatusCode 408/429/5xx, walking inner exceptions) that provider transports throw — previously only a raw HttpRequestException was retried, so the transport retry barely fired for real provider 429/5xx. That was the broader coverage the actor's LlmFailureClassifier.IsTransientStreaming provided.

The transport RetryPolicy is now configured from Session:Tuning:StreamingRetryPolicy (resolves the prior TODO; preserves the knob), making RetryingChatClient the single, configurable, uniform retry layer for every caller (main loop + sidecars).

Remove the actor per-turn retry: the LlmCallFailed pre-content retry block, the RetryLlmCallAfterBackoff message+handler, _streamingRetryAttempt, the streaming-retry timer, and the now-unused IsTransientStreaming. Context-overflow rollback and terminal FailCurrentTurn handling are unchanged.

Re-home coverage: the actor-level Streaming_retry_recovers/_exhaustion integration tests are removed (the actor no longer retries); a ProviderException(502) recover case is added to RetryingChatClientTests at the transport layer.

Verified: full build (0 warnings); Daemon 672/672; Actors 2184/2184; slopwatch + headers clean.

* refactor(openai): drop StreamingOnlyChatClient shim — Netclaw is streaming-only

The streaming-native chat-client stack converts every auxiliary caller
(title generation, memory extraction, compaction) to the streaming
transport, so no code path invokes the non-streaming GetResponseAsync on
the Codex client anymore. The StreamingOnlyChatClient wrapper (added in
0e5ba60 to serve those calls via streaming under the hood) is now dead;
the Codex 'Stream must be set to true' 400 is structurally unreachable.
A comment on the OAuth path records why no special handling is needed.
@Aaronontheweb

Copy link
Copy Markdown
Collaborator Author

superseded via #1309

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

providers Provider integrations and capability detection across OpenAI-compatible backends. reliability Retries, resilience, graceful degradation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant