Skip to content

fix(voice-call): await STT readiness before initial greeting (#75197)#75257

Merged
steipete merged 2 commits intoopenclaw:mainfrom
PfanP:fix/voice-call-stt-startup-readiness
May 1, 2026
Merged

fix(voice-call): await STT readiness before initial greeting (#75197)#75257
steipete merged 2 commits intoopenclaw:mainfrom
PfanP:fix/voice-call-stt-startup-readiness

Conversation

@PfanP
Copy link
Copy Markdown
Contributor

@PfanP PfanP commented Apr 30, 2026

The Twilio media-stream startup raced TTS playback against the OpenAI realtime transcription WebSocket handshake: handleStart called onConnect (which fires manager.speakInitialMessage immediately) and then started sttSession.connect() fire-and-forget. Under event-loop contention from TTS startup the STT WS handshake timed out at 10s, leaving the call half-functional - greeting played, caller speech never reached the agent - while a direct OpenAI realtime WebSocket probe from the same host succeeded in ~1.1s.

Establish STT readiness before firing onConnect so TTS startup cannot starve the STT handshake. When the STT connect rejects, close the STT session, end the Twilio media stream with a 1011 close code, and fire onDisconnect so the voice-call manager hangs up the call on the existing grace path instead of silently leaving the caller on a deaf stream.

Fixes #75197.

Summary

Describe the problem and fix in 2–5 bullets:

If this PR fixes a plugin beta-release blocker, title it fix(<plugin-id>): beta blocker - <summary> and link the matching Beta blocker: <plugin-name> - <summary> issue labeled beta-blocker. Contributors cannot label PRs, so the title is the PR-side signal for maintainers and automation.

  • Problem:
  • Why it matters:
  • What changed:
  • What did NOT change (scope boundary):

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #
  • This PR fixes a bug or regression

Root Cause (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

  • Root cause:
  • Missing detection / guardrail:
  • Contributing context (if known):

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should catch this. Otherwise write N/A.

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file:
  • Scenario the test should lock in:
  • Why this is the smallest reliable guardrail:
  • Existing test that already covers this (if any):
  • If no new test is added, why not:

User-visible / Behavior Changes

List user-visible changes (including defaults/config).
If none, write None.

Diagram (if applicable)

For UI changes or non-trivial logic flows, include a small ASCII diagram reviewers can scan quickly. Otherwise write N/A.

Before:
[user action] -> [old state]

After:
[user action] -> [new state] -> [result]

Security Impact (required)

  • New permissions/capabilities? (Yes/No)
  • Secrets/tokens handling changed? (Yes/No)
  • New/changed network calls? (Yes/No)
  • Command/tool execution surface changed? (Yes/No)
  • Data access scope changed? (Yes/No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS:
  • Runtime/container:
  • Model/provider:
  • Integration/channel (if any):
  • Relevant config (redacted):

Steps

Expected

Actual

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
  • Edge cases checked:
  • What you did not verify:

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No)
  • Config/env changes? (Yes/No)
  • Migration needed? (Yes/No)
  • If yes, exact upgrade steps:

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk:
    • Mitigation:

@openclaw-barnacle openclaw-barnacle Bot added channel: voice-call Channel integration: voice-call size: S triage: blank-template Candidate: PR template appears mostly untouched. labels Apr 30, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 30, 2026

Codex review: needs maintainer review before merge.

What this changes:

This PR updates Voice Call's Twilio media-stream startup to register accepted streams immediately, defer the initial greeting until realtime transcription is ready, close failed STT startups, and add docs, changelog, and regression tests.

Maintainer follow-up before merge:

No repair lane is needed: I found no discrete automated blocker in the current head, and the remaining action is normal maintainer review plus required CI completion.

Security review:

Security review cleared: The diff changes Voice Call startup sequencing, tests, docs, and changelog only; it does not add dependencies, CI execution, package resolution, permissions, secrets handling, or new endpoints.

Review details

Best possible solution:

Land this PR, or an equivalent narrow replacement, after normal maintainer review and required CI complete; keep the linked bug open until the fixing PR merges.

Do we have a high-confidence way to reproduce the issue?

Yes. The linked bug includes concrete Twilio/OpenAI setup, redacted config, exact timeout logs, and persisted call-state evidence, and current main still shows the same startup order in source.

Is this the best way to solve the issue?

Yes. The PR keeps Twilio stream registration immediate for routing and disconnect grace, delays only the initial greeting until STT readiness, relies on the existing audio queue, and closes failed STT startups instead of leaving a deaf stream.

What I checked:

Likely related people:

  • steipete: Local history and API metadata show Peter Steinberger restored and recently maintained the central Voice Call media stream, webhook lifecycle, and related PR head changes. (role: recent maintainer and likely follow-up owner; confidence: high; commits: 42c17adb5e4d, 1d8968c8a821, 9f691099dbd9; files: extensions/voice-call/src/media-stream.ts, extensions/voice-call/src/webhook.ts, docs/plugins/voice-call.md)
  • joshavant: Commit metadata shows Josh Avant authored a broad Voice Call spoken-output and stream TTS regression fix touching the same media-stream, webhook, and initial spoken-output area. (role: adjacent owner; confidence: medium; commits: 3f7f2c8dc96e; files: extensions/voice-call/src/media-stream.ts, extensions/voice-call/src/webhook.ts, extensions/voice-call/src/manager/outbound.ts)
  • eleqtrizit: Commit metadata shows Agustin Rivera recently tightened voice stream ingress guards in the media stream and webhook paths involved in this startup lifecycle. (role: recent adjacent maintainer; confidence: medium; commits: 692438cbb22e; files: extensions/voice-call/src/media-stream.ts, extensions/voice-call/src/webhook.ts)
  • dguido: Commit metadata shows Dan Guido worked on the Voice Call TTS queue path adjacent to the greeting and stream playback behavior affected here. (role: adjacent media-stream contributor; confidence: medium; commits: 101d0f451f23; files: extensions/voice-call/src/media-stream.ts, extensions/voice-call/src/webhook.ts, extensions/voice-call/src/providers/twilio.ts)

Remaining risk / open question:

  • Some broader PR checks were still in progress at review time, so merge should still wait for the required CI set to finish.
  • This read-only review did not rerun a live Twilio/OpenAI/Tailscale call; the reproduction confidence comes from the linked live logs, current-main source shape, and added focused tests.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 4ea0556f6428.

@steipete steipete force-pushed the fix/voice-call-stt-startup-readiness branch from f5f778a to b97b780 Compare May 1, 2026 05:14
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation size: M and removed size: S labels May 1, 2026
@steipete steipete merged commit e8f9c3e into openclaw:main May 1, 2026
79 checks passed
@steipete
Copy link
Copy Markdown
Contributor

steipete commented May 1, 2026

Landed via squash merge onto main.

  • Gate: targeted voice-call/STT tests, docs/changelog checks, Testbox OPENCLAW_TESTBOX=1 pnpm check:changed, and GitHub CI on the final head SHA.
  • Final PR head: aaae466
  • Squash commit: e8f9c3e

Thanks @PfanP!

lxe pushed a commit to lxe/openclaw that referenced this pull request May 6, 2026
Fix Twilio voice-call startup so accepted media streams register immediately, realtime transcription readiness gates only the initial greeting, and early inbound media is preserved while STT connects.

Fixes openclaw#75197.
Thanks @PfanP and @donkeykong91.
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
Fix Twilio voice-call startup so accepted media streams register immediately, realtime transcription readiness gates only the initial greeting, and early inbound media is preserved while STT connects.

Fixes openclaw#75197.
Thanks @PfanP and @donkeykong91.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: voice-call Channel integration: voice-call docs Improvements or additions to documentation size: M triage: blank-template Candidate: PR template appears mostly untouched.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: voice-call OpenAI realtime transcription times out during Twilio media stream while direct WebSocket succeeds

2 participants