Skip to content

feat(msteams): Teams live voice support with .NET media worker#57511

Draft
lupuletic wants to merge 1 commit intoopenclaw:mainfrom
lupuletic:feature/msteams-voice-calls
Draft

feat(msteams): Teams live voice support with .NET media worker#57511
lupuletic wants to merge 1 commit intoopenclaw:mainfrom
lupuletic:feature/msteams-voice-calls

Conversation

@lupuletic
Copy link
Copy Markdown
Contributor

Summary

  • Adds three-tier MS Teams voice support: text bot (anywhere), transcript mode (anywhere), live voice (requires Windows Teams Voice Worker)
  • TS agent plane with capability negotiation, compliance gate, per-speaker unmixed audio pipeline, streaming STT, cut-through TTS, and gRPC bridge
  • .NET 6 media worker scaffolding using Microsoft Graph Communications SDK with unmixed ReceiveUnmixedMeetingAudio for per-speaker audio capture
  • Teams app manifest template (schema 1.24 with RSC permissions)
  • Full docs page at docs/channels/msteams-voice.md

Architecture

OpenClaw (TS, any OS) ←gRPC→ .NET Media Worker (Windows) ←RTP→ MS Teams
  • TS control plane: capability negotiation, compliance gate, STT/TTS pipeline, agent routing
  • .NET media worker: owns call lifecycle via Graph Communications SDK, unmixed audio capture, playback injection
  • Three capability tiers: live_voice (worker available) → transcript_mode (post-meeting artifacts) → text_only

Files

TS voice module (11 new files in extensions/msteams/src/voice/):
types, config, compliance-gate, worker-bridge, manager, streaming-stt, cut-through-tts, own-voice-filter, audio-pipeline, transcript-fallback

.NET media worker (10 new files in extensions/msteams/media-worker/):
CallHandler, ComplianceGate, UnmixedAudioCapture, AudioPlayback, QoEMonitor, WorkerRegistry, BridgeService, bridge.proto

Other: manifest template, docs page, config types, plugin schema

Test plan

  • pnpm check passes (verified)
  • Unit tests for compliance gate, VTT parser, config parsing
  • Mock gRPC worker integration test for TS pipeline
  • .NET media worker builds with dotnet build
  • E2E test with real Teams meeting via Azure Windows VM worker
  • RSC permission spike: validate Calls.AccessMedia.Chat for app-hosted media

🤖 Generated with Claude Code

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation channel: msteams Channel integration: msteams size: XL labels Mar 30, 2026
…chitecture

Three-tier Teams voice: text bot (anywhere), transcript mode (anywhere),
and live voice (requires Windows Teams Voice Worker). Adds TS agent plane
with capability negotiation, compliance gate, per-speaker unmixed audio
pipeline, streaming STT, cut-through TTS, and gRPC bridge to .NET media
worker. Includes .NET 6 media worker scaffolding, Teams app manifest
template (schema 1.24 with RSC), and docs page.
@lupuletic lupuletic force-pushed the feature/msteams-voice-calls branch from 702abc0 to 52deda7 Compare April 5, 2026 13:49
@steipete
Copy link
Copy Markdown
Contributor

Codex maintainer review: valuable direction, but not mergeable in this shape.

Main concerns:

  • Scope is too large for one PR: Teams plugin config, TS voice pipeline, .NET media worker, Graph/RSC docs, SDK surface, STT/TTS routing, and compliance behavior all land together. Split into contract/config/docs first, then worker integration, then live audio pipeline.
  • The test plan still has unchecked critical gates (dotnet build, mock gRPC integration, real Teams/Azure VM E2E, RSC permission spike). Live voice cannot land without proof for the external worker and permission model.
  • src/plugin-sdk/msteams.ts adds a public SDK seam. That needs explicit contract review/versioning and should be justified by a generic plugin need, not only this implementation.
  • Compliance/recording behavior is high-risk. The PR correctly mentions updateRecordingStatus, but the landing bar needs tests or a documented live proof that audio processing is blocked until compliance is active.

I would keep the idea alive, but ask for a much narrower first PR: documented capability tiers + plugin config schema + no-op capability negotiation, with worker/audio code following after the external build and permission proof exists.

Copy link
Copy Markdown
Contributor

@steipete steipete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex deep review: requesting changes. The direction is interesting, but this PR is not safe or functional enough to merge.

Findings:

  1. WorkerBridge resolves the proto from the wrong path.

extensions/msteams/src/voice/worker-bridge.ts sets PROTO_PATH to ../../../media-worker/Protos/bridge.proto from extensions/msteams/src/voice. That resolves to extensions/media-worker/Protos/bridge.proto, not extensions/msteams/media-worker/Protos/bridge.proto. The first live connect() will fail before it can load the gRPC service. From the source layout this should be two levels up, or the proto needs to be copied into the built package and resolved via a packaged asset path.

  1. Remote worker transport is unauthenticated/plaintext while carrying Teams app secrets.

The .NET worker listens on all interfaces (Program.cs uses ListenAnyIP(grpcPort)), and the TS bridge connects with grpc.credentials.createInsecure(). joinMeeting then sends tenantId, appId, and appSecret over that channel. The docs show remote worker addresses/FQDNs, so this is not purely loopback. This needs mTLS or at least an explicit shared-token/TLS story before any remote worker support can land. At minimum, default to loopback-only and reject non-loopback worker addresses unless a secure transport/auth option is configured.

  1. SecretRef app passwords are silently dropped.

extensions/msteams/src/voice/manager.ts resolves appSecret only when msteamsConfig.appPassword is a string; if the normal config contains a SecretRef, the manager sends an empty secret to the worker. The rest of the msteams plugin already has secret-input helpers (extensions/msteams/src/token.ts, secret-input.ts) for this exact contract. Voice must use the same secret resolution path and fail early with a useful error if unresolved.

  1. TTS output is sent to the worker as raw PCM, but textToSpeech() does not guarantee raw 16kHz PCM bytes.

extensions/msteams/src/voice/manager.ts reads ttsResult.audioPath and passes those file bytes directly to WorkerBridge.playAudio; AudioPlayback.cs treats the bytes as 16kHz mono signed PCM frames. If the configured TTS provider emits WAV/MP3/Opus or another sample rate, the worker will inject encoded/container bytes as PCM noise. The code needs an explicit decode/resample step to the worker's required format, or the worker protocol should carry MIME/container metadata and decode there.

  1. gRPC stream subscriptions leak delegates.

BridgeService.SubscribeUnmixedAudio and SubscribeEvents add callbacks to session.AudioSubscribers / session.EventSubscribers but never remove them in finally. ConcurrentBag has no removal path, so every disconnected TS client leaves a dead subscriber attached for the lifetime of the call. Use a removable subscription collection or per-subscriber channel registry and remove on stream completion/cancellation.

  1. Speaker identity is currently fabricated.

CallHandler.ResolveSpeaker() iterates participants but never maps the media speaker id to a participant; it always returns AadUserId = speakerId.ToString() and DisplayName = Speaker-{speakerId}. That means the TS prompt path will attribute real participants to synthetic speaker labels. If per-speaker unmixed audio is part of the value prop, this needs a real mapping or the user-visible docs should not claim identified speakers yet.

  1. The public SDK seam is not justified by this PR.

src/plugin-sdk/msteams.ts exports MSTeamsVoiceConfig and MSTeamsVoicePermissionMode, creating a public SDK contract for an implementation that has not proven the worker protocol, permission model, or transport security. Keep the first pass plugin-private, or land the SDK addition separately with explicit versioned contract review.

Suggested split: first land a tiny config/status capability tier PR with tests. Then a worker protocol PR with secure transport and build proof. Then audio/STT/TTS integration with real format conversion and a live worker smoke. This PR currently bundles all of those risk surfaces together.

@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Apr 30, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 1, 2026

Codex review: found issues before merge.

Summary
This draft PR adds Microsoft Teams live voice docs/config/schema, a TypeScript voice manager and gRPC worker bridge, a Teams manifest template, and a Windows .NET media-worker scaffold.

Reproducibility: yes. for the review blockers: static inspection of the PR head reproduces the insecure bridge, undeclared gRPC imports, raw TTS-file-to-PCM playback path, receive-only audio socket, and floating NuGet ranges. No live Teams E2E reproduction is available from the supplied context, and the PR test plan leaves that proof unchecked.

Next step before merge
Maintainer next action is product/security/ownership review and PR decomposition, not an automated repair, because the blockers span bridge security, public config/SDK contract, .NET worker dependencies, compliance policy, and live Microsoft validation.

Security
Needs attention: The PR introduces an unauthenticated insecure media-worker bridge and floating .NET dependency ranges that must be fixed before merge.

Review findings

  • [P1] Authenticate the worker bridge before sending secrets — extensions/msteams/src/voice/worker-bridge.ts:121
  • [P2] Declare the gRPC client dependencies — extensions/msteams/src/voice/worker-bridge.ts:28-38
  • [P2] Convert TTS output before streaming it as PCM — extensions/msteams/src/voice/manager.ts:486-489
Review details

Best possible solution:

Land a narrower owner-reviewed sequence: capability-tier docs and config first, then an authenticated worker bridge, deterministic .NET project, compliance-gated audio path, and live/RSC proof in separate PRs.

Do we have a high-confidence way to reproduce the issue?

Yes for the review blockers: static inspection of the PR head reproduces the insecure bridge, undeclared gRPC imports, raw TTS-file-to-PCM playback path, receive-only audio socket, and floating NuGet ranges. No live Teams E2E reproduction is available from the supplied context, and the PR test plan leaves that proof unchecked.

Is this the best way to solve the issue?

No. The current all-in-one PR is not the best way to solve the feature because it combines contract, worker infrastructure, security-sensitive media control, compliance behavior, and realtime audio before the dependency and permission contracts are proven.

Full review comments:

  • [P1] Authenticate the worker bridge before sending secrets — extensions/msteams/src/voice/worker-bridge.ts:121
    connect() creates an insecure gRPC channel, while JoinMeetingRequest carries appSecret and the worker listens on all interfaces. Require TLS/mTLS or a scoped bridge token before sending credentials or call/audio control over this port.
    Confidence: 0.92
  • [P2] Declare the gRPC client dependencies — extensions/msteams/src/voice/worker-bridge.ts:28-38
    The new bridge imports @grpc/grpc-js and @grpc/proto-loader, but the Teams plugin package and lockfile do not declare them. Typecheck/package builds and voice.enabled startup will fail before capability negotiation can run.
    Confidence: 0.95
  • [P2] Convert TTS output before streaming it as PCM — extensions/msteams/src/voice/manager.ts:486-489
    textToSpeech() returns an audio file in the provider output format, but this code reads those bytes and sends them to a worker API documented as raw 16 kHz mono PCM. Use telephony PCM output or an explicit transcode step before playback.
    Confidence: 0.88
  • [P2] Open a send-capable audio socket for playback — extensions/msteams/media-worker/CallHandler.cs:123-126
    The worker creates the call audio socket with StreamDirection.Recvonly, then later uses the same socket for AudioPlayback.Send(). A receive-only media stream cannot satisfy the advertised listen-and-speak live voice path.
    Confidence: 0.82
  • [P2] Pin the media worker package versions — extensions/msteams/media-worker/TeamsMediaWorker.csproj:11-15
    The .NET project restores wildcard package ranges such as 1.2.*, 2.*, and 3.*. Pin exact versions so worker builds are reproducible and Graph/gRPC runtime behavior cannot change without a source diff.
    Confidence: 0.9

Overall correctness: patch is incorrect
Overall confidence: 0.91

Security concerns:

  • [high] Unauthenticated insecure worker bridge — extensions/msteams/src/voice/worker-bridge.ts:121
    The TypeScript client uses insecure gRPC while JoinMeetingRequest includes the app secret and call/audio control methods. If the documented worker port is reachable, this can expose credentials, meeting audio, and call control.
    Confidence: 0.92
  • [medium] Floating NuGet package ranges — extensions/msteams/media-worker/TeamsMediaWorker.csproj:11
    The media worker uses wildcard package versions for Microsoft Graph Communications, gRPC, protobuf, and Grpc.Tools, weakening supply-chain reproducibility and allowing build/runtime behavior to change without a source diff.
    Confidence: 0.9

What I checked:

Likely related people:

  • steipete: Recent main history shows multiple Teams docs/plugin maintenance commits, and the existing maintainer review on this PR names the split/proof requirements for the Teams voice direction. (role: recent maintainer and reviewer; confidence: high; commits: 3f002b10d281, 08c4af0ddf62, dd098596cf34; files: extensions/msteams/src/monitor.ts, docs/channels/msteams.md, extensions/msteams/openclaw.plugin.json)
  • SidU: Authored the current Teams SDK and AI UX migration that underlies monitor, SDK, JWT validation, streaming/status behavior, and the current Teams channel runtime surface this PR extends. (role: introduced current Teams SDK runtime base; confidence: medium; commits: cd90130877f1; files: extensions/msteams/src/monitor.ts, extensions/msteams/src/sdk.ts, docs/channels/msteams.md)
  • BradGroux: Recent history shows substantial Teams plugin work and coauthorship around Teams config, streaming, auth, and federated credential support, which are adjacent to this PR's Teams voice/auth surface. (role: adjacent Teams plugin maintainer; confidence: medium; commits: 03c64df39fe7, fce81fccd859, 6b0e74000d9f; files: extensions/msteams/src/monitor.ts, src/config/types.msteams.ts, docs/channels/msteams.md)
  • HDYA: Authored the federated credential support that expanded the Teams auth/config contract; this PR's media worker and permission model would need to fit that owner-reviewed authentication surface. (role: adjacent Teams authentication owner; confidence: medium; commits: 26f633b604fd; files: src/config/types.msteams.ts, docs/channels/msteams.md, extensions/msteams/src/token.ts)

Remaining risk / open question:

  • The PR is a draft and its own checklist still lacks dotnet build, mock gRPC integration, live Teams/Azure VM E2E, and RSC permission proof.
  • The proposed bridge currently exposes call/audio control and app secrets over an unauthenticated insecure channel if the worker port is reachable.
  • The feature spans public config/SDK contract, external Windows infrastructure, compliance policy, realtime media, and provider audio behavior, so a narrow automated repair would not be safe.

Codex review notes: model gpt-5.5, reasoning high; reviewed against cc8a8f1df1cd.

@openclaw-barnacle openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: msteams Channel integration: msteams docs Improvements or additions to documentation size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants