Skip to content

diag(agents): stage-stall watchdog names the hung startup stage#10

Merged
matin merged 1 commit into
mainfrom
stall-stage-watchdog
Jun 5, 2026
Merged

diag(agents): stage-stall watchdog names the hung startup stage#10
matin merged 1 commit into
mainfrom
stall-stage-watchdog

Conversation

@matin

@matin matin commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Instrumentation for tulgey#238: stalled embedded runs currently leave nothing in the journal. With this, any stage stuck >30s logs embedded run stage stalled: runId=... lastCompletedStage=<name> stalledMs=... every 15s (capped at 10) — the next stall names its stage. Watchdog stops at first snapshot() so completed phases never false-report. tsgo clean, stage-timing tests 5/5.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Improvements
    • Watchdog monitoring added to embedded agent operations that logs warnings when stalled for more than 30 seconds without progress, including stage details and stall duration information.

…e journal

tulgey#238: embedded runs stall between stage marks with zero log output, no
trace events, and no timeout — the journal shows 'embedded_run:started' and
then silence, making the hung stage undiagnosable in production. The stage
tracker now takes an optional watchdog that logs the last completed stage when
no new mark lands within 30s (15s poll, unref'd, stops at first snapshot(),
self-caps at 10 reports). Wired for startup stages (run.ts) and prep stages
(attempt.ts).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@matin matin merged commit d63adad into main Jun 5, 2026
@matin matin deleted the stall-stage-watchdog branch June 5, 2026 16:26
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cd650ef8-d388-4e8a-b9db-6970743f0b74

📥 Commits

Reviewing files that changed from the base of the PR and between 5a38527 and 5ae0bfa.

📒 Files selected for processing (3)
  • src/agents/embedded-agent-runner/run.ts
  • src/agents/embedded-agent-runner/run/attempt-stage-timing.ts
  • src/agents/embedded-agent-runner/run/attempt.ts

📝 Walkthrough

Walkthrough

Stage trackers for embedded agent runs now support an optional watchdog that detects stalls lasting longer than a configured threshold (30s default) and logs structured warnings via a callback. The implementation tracks elapsed time since the last stage mark, caps warnings at 10 reports, and cleans up when the phase completes. Run startup and attempt prep trackers are wired to use this watchdog, labeling warnings with their respective context.

Changes

Stall detection watchdog

Layer / File(s) Summary
Watchdog stall detection contract and implementation
src/agents/embedded-agent-runner/run/attempt-stage-timing.ts
Options type extended with optional watchdog configuration: label, warn callback, and warnAfterMs threshold. Watchdog timer implementation uses setInterval to detect stalls since the previous mark, emits structured warnings when stall exceeds threshold, self-limits to 10 reports, and is cleaned up when snapshot() completes.
Watchdog wiring in tracker initialization
src/agents/embedded-agent-runner/run.ts, src/agents/embedded-agent-runner/run/attempt.ts
Run startup and attempt prep stage trackers now initialize with watchdog config that labels warnings by runId/sessionId and phase context, routing them through log.warn.

🎯 2 (Simple) | ⏱️ ~12 minutes

🐰 A watchdog hops through the stages, keeping watch with care,
When staleness creeps in like a fox through the air,
It logs a fair warning ten times at most,
Then returns to its sleep—a most diligent post!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch stall-stage-watchdog

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

matin added a commit that referenced this pull request Jun 9, 2026
…aw#247) (#17)

* feat(speech): native audio output via Vertex ADC route (tulgey openclaw#247)

The Google speech provider already emits native generateContent AUDIO
(gemini-3.1-flash-tts-preview, responseModalities:['AUDIO'] + speechConfig)
and already transcodes to opus-in-ogg for voice-note delivery. The only
gap was auth: it knew the AI-Studio key route only and threw "Google API
key missing" on a keyless Vertex deployment (tulgey #10). This adds the
Vertex ADC route so native output is the primary path on the deployment.

- Add a Vertex ADC synthesis route (synthesizeGoogleVertexTtsPcm) that
  rides resolveGoogleVertexAuthorizedUserHeaders (the same ADC bearer the
  Google chat/Veo paths use), POSTing to
  aiplatform.googleapis.com/v1/projects/{P}/locations/{global}/publishers/
  google/models/{model}:generateContent. Body, PCM extraction, WAV-wrap,
  and opus transcode are shared verbatim with the AI-Studio route.
- Route selection (resolveGoogleTtsPcm): AI-Studio key route stays primary;
  fall to the Vertex ADC route when no key but ADC is present; throw with
  neither so the speech provider-order fallback (Cloud TTS -> text) trips
  on a detected failure, never a silent degrade (ADR 0024 clause 2).
- isConfigured is now ADC-aware so the provider is selected keyless.
- Extract buildGoogleSpeechGenerateContentBody (shared by both routes).
- Test: Vertex generateContent URL shape (global + regional).

Implements the membrane row of tulgey#247 / ADR 0024. Existing AI-Studio
tests unaffected (real keys take the unchanged route).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(google): clear pre-existing oxlint errors in the Veo provider

lint:extensions:bundled lints the whole extensions/google package, so
these errors (introduced with the Veo REST fallback in #5, never linted
since no later PR touched the package) block any PR that touches the
extension. Surfaced by the native-audio-output change.

- resolveVertexOAuthToken: brace the metadata-token if, type res.json()
  as { access_token?: string } (drops the unnecessary `as any`), and
  omit the unused catch binding.
- brace the "Force rest fallback for Vertex" guard.

No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(speech): route Vertex TTS through guarded postJsonRequest

The new Vertex ADC route used a raw fetch(), which trips the
no-raw-channel-fetch boundary guard. Route it through postJsonRequest
(the same guarded helper the AI-Studio route uses) so SSRF/dispatcher
policy and timeout handling apply uniformly; drop the manual
AbortController.

Also allowlist the pre-existing Veo metadata-server fetch
(video-generation-provider.ts:44, http://metadata.google.internal —
link-local, must be raw; the SSRF guard intentionally blocks it). It
predates this work and was surfaced when the PR first touched the
package.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant