diag(agents): stage-stall watchdog names the hung startup stage#10
Conversation
…e journal tulgey#238: embedded runs stall between stage marks with zero log output, no trace events, and no timeout — the journal shows 'embedded_run:started' and then silence, making the hung stage undiagnosable in production. The stage tracker now takes an optional watchdog that logs the last completed stage when no new mark lands within 30s (15s poll, unref'd, stops at first snapshot(), self-caps at 10 reports). Wired for startup stages (run.ts) and prep stages (attempt.ts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughStage trackers for embedded agent runs now support an optional watchdog that detects stalls lasting longer than a configured threshold (30s default) and logs structured warnings via a callback. The implementation tracks elapsed time since the last stage mark, caps warnings at 10 reports, and cleans up when the phase completes. Run startup and attempt prep trackers are wired to use this watchdog, labeling warnings with their respective context. ChangesStall detection watchdog
🎯 2 (Simple) | ⏱️ ~12 minutes
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…aw#247) (#17) * feat(speech): native audio output via Vertex ADC route (tulgey openclaw#247) The Google speech provider already emits native generateContent AUDIO (gemini-3.1-flash-tts-preview, responseModalities:['AUDIO'] + speechConfig) and already transcodes to opus-in-ogg for voice-note delivery. The only gap was auth: it knew the AI-Studio key route only and threw "Google API key missing" on a keyless Vertex deployment (tulgey #10). This adds the Vertex ADC route so native output is the primary path on the deployment. - Add a Vertex ADC synthesis route (synthesizeGoogleVertexTtsPcm) that rides resolveGoogleVertexAuthorizedUserHeaders (the same ADC bearer the Google chat/Veo paths use), POSTing to aiplatform.googleapis.com/v1/projects/{P}/locations/{global}/publishers/ google/models/{model}:generateContent. Body, PCM extraction, WAV-wrap, and opus transcode are shared verbatim with the AI-Studio route. - Route selection (resolveGoogleTtsPcm): AI-Studio key route stays primary; fall to the Vertex ADC route when no key but ADC is present; throw with neither so the speech provider-order fallback (Cloud TTS -> text) trips on a detected failure, never a silent degrade (ADR 0024 clause 2). - isConfigured is now ADC-aware so the provider is selected keyless. - Extract buildGoogleSpeechGenerateContentBody (shared by both routes). - Test: Vertex generateContent URL shape (global + regional). Implements the membrane row of tulgey#247 / ADR 0024. Existing AI-Studio tests unaffected (real keys take the unchanged route). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(google): clear pre-existing oxlint errors in the Veo provider lint:extensions:bundled lints the whole extensions/google package, so these errors (introduced with the Veo REST fallback in #5, never linted since no later PR touched the package) block any PR that touches the extension. Surfaced by the native-audio-output change. - resolveVertexOAuthToken: brace the metadata-token if, type res.json() as { access_token?: string } (drops the unnecessary `as any`), and omit the unused catch binding. - brace the "Force rest fallback for Vertex" guard. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(speech): route Vertex TTS through guarded postJsonRequest The new Vertex ADC route used a raw fetch(), which trips the no-raw-channel-fetch boundary guard. Route it through postJsonRequest (the same guarded helper the AI-Studio route uses) so SSRF/dispatcher policy and timeout handling apply uniformly; drop the manual AbortController. Also allowlist the pre-existing Veo metadata-server fetch (video-generation-provider.ts:44, http://metadata.google.internal — link-local, must be raw; the SSRF guard intentionally blocks it). It predates this work and was surfaced when the PR first touched the package. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Instrumentation for tulgey#238: stalled embedded runs currently leave nothing in the journal. With this, any stage stuck >30s logs
embedded run stage stalled: runId=... lastCompletedStage=<name> stalledMs=...every 15s (capped at 10) — the next stall names its stage. Watchdog stops at firstsnapshot()so completed phases never false-report. tsgo clean, stage-timing tests 5/5.🤖 Generated with Claude Code
Summary by CodeRabbit