feat(azure-speech): add realtime transcription provider for voice-call#73456
feat(azure-speech): add realtime transcription provider for voice-call#73456ottodeng wants to merge 17 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR adds Azure Speech as a realtime transcription provider for the
Confidence Score: 3/5Not safe to merge as-is: lazy connect silently drops the first audio frame and fires an unexpected onError on every session that skips explicit connect(). One P1 data-loss bug (first audio frame discarded on lazy connect) with no test coverage catching it, capping confidence at 4; the pattern affects a core advertised feature path, pulling it down to 3. extensions/azure-speech/realtime-transcription-session.ts (lazy-connect sendAudio race), extensions/azure-speech/realtime-transcription-lifecycle.test.ts (missing assertions on audio delivery and onError in lazy-connect test) Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/azure-speech/realtime-transcription-session.ts
Line: 281-289
Comment:
**First audio frame dropped on lazy connect**
When `sendAudio` is invoked before `connect()` has resolved, it fires `connect()` asynchronously and then immediately calls the inner `sendAudio`. Because `pushStream` is still `undefined` at that synchronous moment, the inner guard triggers:
```ts
if (!pushStream) {
handleError(new Error("Azure Speech push stream is not initialized"));
return;
}
```
The audio from that triggering frame is silently discarded and `onError` fires unexpectedly on every lazy-connect caller. The test at line 353 (`lazily connects on the first audio frame`) only asserts that `startSpy` was called; it never checks `pushStream.writeSpy` or `onError`, so this data-loss path goes undetected.
If parity with sibling providers requires queuing audio until the stream is ready, the first frame should be buffered and flushed once `connect()` resolves rather than being forwarded before `pushStream` exists.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/azure-speech/realtime-transcription-provider.ts
Line: 217-223
Comment:
**Redundant re-normalization inside `isConfigured`**
`providerConfig` passed to `isConfigured` is already the resolved output of `resolveConfig` (which returns an `AzureSpeechRealtimeProviderConfig`). Calling `normalizeAzureSpeechRealtimeProviderConfig` on it again is wasteful and relies on the normalization being idempotent. The check can use the typed fields directly:
```ts
isConfigured: ({ providerConfig }) => {
const cfg = providerConfig as AzureSpeechRealtimeProviderConfig;
if (!cfg.apiKey) return false;
return Boolean(cfg.endpoint || cfg.region);
},
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(azure-speech): add realtime transcr..." | Re-trigger Greptile |
| sendAudio(audio: Buffer) { | ||
| if (!connected && !connecting) { | ||
| // Lazy connect on first audio frame for parity with sibling providers. | ||
| connect().catch((error) => | ||
| handleError(error instanceof Error ? error : new Error(String(error))), | ||
| ); | ||
| } | ||
| sendAudio(audio); | ||
| }, |
There was a problem hiding this comment.
First audio frame dropped on lazy connect
When sendAudio is invoked before connect() has resolved, it fires connect() asynchronously and then immediately calls the inner sendAudio. Because pushStream is still undefined at that synchronous moment, the inner guard triggers:
if (!pushStream) {
handleError(new Error("Azure Speech push stream is not initialized"));
return;
}The audio from that triggering frame is silently discarded and onError fires unexpectedly on every lazy-connect caller. The test at line 353 (lazily connects on the first audio frame) only asserts that startSpy was called; it never checks pushStream.writeSpy or onError, so this data-loss path goes undetected.
If parity with sibling providers requires queuing audio until the stream is ready, the first frame should be buffered and flushed once connect() resolves rather than being forwarded before pushStream exists.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/azure-speech/realtime-transcription-session.ts
Line: 281-289
Comment:
**First audio frame dropped on lazy connect**
When `sendAudio` is invoked before `connect()` has resolved, it fires `connect()` asynchronously and then immediately calls the inner `sendAudio`. Because `pushStream` is still `undefined` at that synchronous moment, the inner guard triggers:
```ts
if (!pushStream) {
handleError(new Error("Azure Speech push stream is not initialized"));
return;
}
```
The audio from that triggering frame is silently discarded and `onError` fires unexpectedly on every lazy-connect caller. The test at line 353 (`lazily connects on the first audio frame`) only asserts that `startSpy` was called; it never checks `pushStream.writeSpy` or `onError`, so this data-loss path goes undetected.
If parity with sibling providers requires queuing audio until the stream is ready, the first frame should be buffered and flushed once `connect()` resolves rather than being forwarded before `pushStream` exists.
How can I resolve this? If you propose a fix, please make it concise.| isConfigured: ({ providerConfig }) => { | ||
| const normalized = normalizeAzureSpeechRealtimeProviderConfig(providerConfig); | ||
| if (!normalized.apiKey) { | ||
| return false; | ||
| } | ||
| return Boolean(normalized.endpoint || normalized.region); | ||
| }, |
There was a problem hiding this comment.
Redundant re-normalization inside
isConfigured
providerConfig passed to isConfigured is already the resolved output of resolveConfig (which returns an AzureSpeechRealtimeProviderConfig). Calling normalizeAzureSpeechRealtimeProviderConfig on it again is wasteful and relies on the normalization being idempotent. The check can use the typed fields directly:
isConfigured: ({ providerConfig }) => {
const cfg = providerConfig as AzureSpeechRealtimeProviderConfig;
if (!cfg.apiKey) return false;
return Boolean(cfg.endpoint || cfg.region);
},Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/azure-speech/realtime-transcription-provider.ts
Line: 217-223
Comment:
**Redundant re-normalization inside `isConfigured`**
`providerConfig` passed to `isConfigured` is already the resolved output of `resolveConfig` (which returns an `AzureSpeechRealtimeProviderConfig`). Calling `normalizeAzureSpeechRealtimeProviderConfig` on it again is wasteful and relies on the normalization being idempotent. The check can use the typed fields directly:
```ts
isConfigured: ({ providerConfig }) => {
const cfg = providerConfig as AzureSpeechRealtimeProviderConfig;
if (!cfg.apiKey) return false;
return Boolean(cfg.endpoint || cfg.region);
},
```
How can I resolve this? If you propose a fix, please make it concise.|
Codex review: needs real behavior proof before merge. Summary Reproducibility: not applicable. this is a feature PR, not a bug report. The merge blocker is source-reproducible by comparing the PR head manifest with the PR head generated plugin docs, and real behavior proof is absent from the discussion. Real behavior proof Next step before merge Security Review findings
Review detailsBest possible solution: Keep Azure STT in the azure-speech plugin and merge after the generated plugin docs are regenerated and the contributor attaches redacted live transcription output or logs from a real Azure Speech run. Do we have a high-confidence way to reproduce the issue? Not applicable: this is a feature PR, not a bug report. The merge blocker is source-reproducible by comparing the PR head manifest with the PR head generated plugin docs, and real behavior proof is absent from the discussion. Is this the best way to solve the issue? Mostly yes: the existing realtime transcription registry and azure-speech plugin are the right boundary for this provider. The current patch is not merge-ready until generated plugin docs are synchronized and after-fix real behavior proof is supplied. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 9fa685e3b3e4. |
1739205 to
dd7626b
Compare
Register Azure Speech as a realtime transcription provider, joining Deepgram, ElevenLabs, Mistral, OpenAI, and xAI in the bundled provider list. Voice Call streaming can now select `azure-speech` (or alias `azure`) instead of being limited to non-Microsoft transcription backends. The provider uses the official `microsoft-cognitiveservices-speech-sdk` package over Azure's documented continuous-recognition WebSocket protocol, which already handles partial results, automatic reconnects, and final-utterance detection. The SDK is loaded lazily on first use, so installations that only use Azure Speech TTS pay no extra startup cost. Configuration lives under `plugins.entries.voice-call.config.streaming.providers.azure-speech.*` and reuses the existing `AZURE_SPEECH_KEY` / `AZURE_SPEECH_REGION` / `SPEECH_KEY` / `SPEECH_REGION` env-var fallbacks. New options: - `language` (default en-US, BCP-47) - `sampleRate` (default 8000, matches telephony media streams) - `encoding` (`pcm` / `mulaw` / `alaw`, default mulaw) - `initialSilenceTimeoutMs` / `endSilenceTimeoutMs` - `candidateLanguages` (auto language detection) - `endpoint` (sovereign cloud / private endpoint) Tests: 33 unit tests cover config normalization (env fallbacks, provider sub-config, encoding aliases), the registry contract (`isConfigured` / `createSession`), session lifecycle (connect via subscription/endpoint/auto-detect, partial vs final transcripts, NoMatch handling, error propagation, audio overflow, lazy connect, graceful close). A live integration test synthesizes telephony audio with the existing TTS provider and feeds it back through the new STT provider end-to-end (skipped without `AZURE_SPEECH_KEY`). Docs: `docs/providers/azure-speech.md` adds a STT section and config table; `docs/plugins/voice-call.md` adds an Azure Speech tab to the streaming provider examples.
472d372 to
879e61c
Compare
… isConfigured Address Greptile review on PR openclaw#73456: P1 — first audio frame dropped on lazy connect: When sendAudio() ran before connect() resolved, the inner sendAudio was invoked synchronously while pushStream was still undefined, triggering 'push stream is not initialized' onError and silently discarding the triggering frame. Now buffer pending audio frames (capped at AZURE_SPEECH_REALTIME_MAX_QUEUED_BYTES), kick off connect() once, and flush them after the push stream is ready. The lazy-connect lifecycle test now also asserts the buffered frame is delivered to the push stream and that no unexpected onError fires. P2 — redundant re-normalization in isConfigured: providerConfig is already the resolved output of resolveConfig, so use the typed AzureSpeechRealtimeProviderConfig fields directly instead of calling normalize again.
|
Thanks for the review! Pushed P1 — first audio frame dropped on lazy connect. Now buffer pending audio frames in a small queue (capped at the existing P2 — redundant re-normalization in All 33 azure-speech unit tests still pass locally. The remaining red CI checks on this PR ( |
…ealtime-stt # Conflicts: # docs/.generated/config-baseline.sha256
Resolve conflict in docs/.generated/config-baseline.sha256 by taking the regenerated baseline from main; this PR does not introduce new config schema rows so the latest main hashes are correct.
Voice Call invokes session.connect() in the background, which awaits the async SDK load and recognizer start. If the websocket closes during that window, close() previously set closing=true but the in-flight chain continued to allocate the push stream and recognizer, leaving them and the upstream Azure socket alive past close(). Now the connect chain checks closing after each await and tears down any freshly created push stream / recognizer / started recognition. teardown() also bails when connect() is still pending and nothing has been allocated yet, leaving cleanup to the connect chain itself.
|
Addressed the three P2 clawsweeper findings: 1. Abort connection when session is closed (
2. Regenerated config-doc baseline — 3. CHANGELOG entry added under Unreleased \u203a Changes. Commit 6b1de3f. Validation
HEAD: 6b1de3f |
…ealtime-stt # Conflicts: # CHANGELOG.md # docs/.generated/config-baseline.sha256
…ealtime-stt # Conflicts: # CHANGELOG.md
|
Merged latest |
…installs Address clawsweeper [P1]: extension-local declaration of microsoft-cognitiveservices-speech-sdk does not survive packaged bundled installs (postinstall does not install plugin package dependencies, bundled runtime staging skips plugin node_modules). Hoisting the SDK to the root manifest makes it resolvable from packaged installs while keeping the provider extension-owned. The dependency stays a runtime dependency (not devDependencies) since the Azure Speech provider lazily loads it at provider-runtime time when realtime transcription is requested.
|
Addressed clawsweeper [P1] in Bug: Surface: bundled distribution installs (NPM tarball, Docker image, macOS bundle) loading the Azure Speech extension's realtime transcription provider. Fix: hoisted Why best: the alternative (moving Azure Speech behind an explicit downloadable plugin install path) is a larger architectural change that affects all SDK-heavy providers; hoisting the dep matches the existing pattern used by |
…ealtime-stt # Conflicts: # CHANGELOG.md
…ealtime-stt # Conflicts: # CHANGELOG.md
…ealtime-stt # Conflicts: # CHANGELOG.md
75ba3ff to
07e2403
Compare
07e2403 to
a8809d3
Compare
…ealtime-stt # Conflicts: # CHANGELOG.md
|
Closing — this grew into an XL change (size: XL, +1818) without the Real behavior proof maintainers expect for external PRs of this scope. Will rework as a smaller, scoped PR (provider config + minimal wiring) with end-to-end transcription evidence on a real device before reopening. |
Why
OpenClaw's
voice-callplugin can stream live call audio to a realtimetranscription provider, and it already has a clean pluggable provider
interface (added by #68697). Today five providers are
registered out of the box — Deepgram, ElevenLabs, Mistral, OpenAI, and
xAI — but Azure Speech is missing, even though OpenClaw already ships an
azure-speechplugin for TTS. That forces users who want to standardize onMicrosoft's speech stack (or who already have Azure Speech keys provisioned)
to depend on a non-Microsoft service for STT.
This PR closes that gap by registering Azure Speech as a realtime
transcription provider and reusing the existing
azure-speechplugin'sconfig and env-var fallbacks.
What
extensions/azure-speech/:realtime-transcription-provider.ts— registry contract, confignormalization, and provider builder.
realtime-transcription-session.ts— the recognizer session: lazyconnect, partial/final transcript routing, NoMatch handling, error
propagation, audio queueing with overflow protection, idempotent
teardown.
realtime-transcription-types.ts— minimal structural typing of themicrosoft-cognitiveservices-speech-sdkAPI surface used here.index.tscallsapi.registerRealtimeTranscriptionProvider(...)next tothe existing
api.registerSpeechProvider(...).openclaw.plugin.jsondeclares the newrealtimeTranscriptionProviders: ["azure-speech", "azure"]contract anddocuments the new schema fields.
package.jsonaddsmicrosoft-cognitiveservices-speech-sdkas a runtimedependency. The SDK is lazy-loaded on first session creation, so users
who only use Azure Speech TTS pay no extra startup cost.
Why use the official SDK and not a hand-rolled WebSocket?
Azure Speech's continuous-recognition wire protocol (USP) is not publicly
documented. The official SDK already handles the parts that matter for a
streaming use case — connection setup, USP framing, partial vs final
results, automatic reconnects, end-of-utterance detection — and is the
pattern Microsoft's docs recommend. Lazy import keeps the cost off the
critical path for non-STT users (mirrors the pattern used by the existing
microsoft-speechplugin withnode-edge-tts).Configuration
Reuses the existing
AZURE_SPEECH_KEY/AZURE_SPEECH_API_KEY/SPEECH_KEYandAZURE_SPEECH_REGION/SPEECH_REGIONenv-var fallbacks.Voice Call config:
Sovereign clouds and private endpoints can use
endpointinstead ofregion.Tests
providers.<id>sub-config,encoding aliases (
linear16→pcm,g711_ulaw→mulaw, etc.),invalid encoding rejection, candidate-language parsing.
id/aliases,isConfiguredfor theregion+key, endpoint+key, missing-key, and missing-location cases,
createSessionerror messages.fromSubscriptionandfromEndpoint,auto-detect via
SpeechRecognizer.FromConfig, partial/finaltranscripts, NoMatch handling, error vs EndOfStream cancellation,
start-failure rejection, audio buffer overflow, lazy connect on
first audio frame, idempotent close, ignored-after-close audio.
azure-speech.live.test.ts(skippedwithout
AZURE_SPEECH_KEY/AZURE_SPEECH_REGION): synthesizes a shortµ-law clip with the existing TTS provider and feeds it through the new
STT provider in 20 ms frames, verifying the round-trip transcript.
pnpm vitest run extensions/azure-speech/ extensions/voice-call/:328 tests across 43 files, all green.
pnpm tsgo:extensions,pnpm tsgo:extensions:test,pnpm lint:extensions,pnpm config:docs:check,pnpm config:schema:check,pnpm check:loc: all clean.Docs
docs/providers/azure-speech.md— adds the realtime transcription configtable, a "Realtime transcription" accordion entry, and updates the page
summary.
docs/plugins/voice-call.md— adds Azure Speech to the bundledrealtime transcription provider list and a full Azure Speech tab in the
streaming provider examples.
Related
commit c866820fed0).
outbound voice notes and inbound transcription against the same Azure
Speech resource.