feat(msteams): Teams live voice support with .NET media worker#57511
feat(msteams): Teams live voice support with .NET media worker#57511lupuletic wants to merge 1 commit intoopenclaw:mainfrom
Conversation
…chitecture Three-tier Teams voice: text bot (anywhere), transcript mode (anywhere), and live voice (requires Windows Teams Voice Worker). Adds TS agent plane with capability negotiation, compliance gate, per-speaker unmixed audio pipeline, streaming STT, cut-through TTS, and gRPC bridge to .NET media worker. Includes .NET 6 media worker scaffolding, Teams app manifest template (schema 1.24 with RSC), and docs page.
702abc0 to
52deda7
Compare
|
Codex maintainer review: valuable direction, but not mergeable in this shape. Main concerns:
I would keep the idea alive, but ask for a much narrower first PR: documented capability tiers + plugin config schema + no-op capability negotiation, with worker/audio code following after the external build and permission proof exists. |
steipete
left a comment
There was a problem hiding this comment.
Codex deep review: requesting changes. The direction is interesting, but this PR is not safe or functional enough to merge.
Findings:
WorkerBridgeresolves the proto from the wrong path.
extensions/msteams/src/voice/worker-bridge.ts sets PROTO_PATH to ../../../media-worker/Protos/bridge.proto from extensions/msteams/src/voice. That resolves to extensions/media-worker/Protos/bridge.proto, not extensions/msteams/media-worker/Protos/bridge.proto. The first live connect() will fail before it can load the gRPC service. From the source layout this should be two levels up, or the proto needs to be copied into the built package and resolved via a packaged asset path.
- Remote worker transport is unauthenticated/plaintext while carrying Teams app secrets.
The .NET worker listens on all interfaces (Program.cs uses ListenAnyIP(grpcPort)), and the TS bridge connects with grpc.credentials.createInsecure(). joinMeeting then sends tenantId, appId, and appSecret over that channel. The docs show remote worker addresses/FQDNs, so this is not purely loopback. This needs mTLS or at least an explicit shared-token/TLS story before any remote worker support can land. At minimum, default to loopback-only and reject non-loopback worker addresses unless a secure transport/auth option is configured.
- SecretRef app passwords are silently dropped.
extensions/msteams/src/voice/manager.ts resolves appSecret only when msteamsConfig.appPassword is a string; if the normal config contains a SecretRef, the manager sends an empty secret to the worker. The rest of the msteams plugin already has secret-input helpers (extensions/msteams/src/token.ts, secret-input.ts) for this exact contract. Voice must use the same secret resolution path and fail early with a useful error if unresolved.
- TTS output is sent to the worker as raw PCM, but
textToSpeech()does not guarantee raw 16kHz PCM bytes.
extensions/msteams/src/voice/manager.ts reads ttsResult.audioPath and passes those file bytes directly to WorkerBridge.playAudio; AudioPlayback.cs treats the bytes as 16kHz mono signed PCM frames. If the configured TTS provider emits WAV/MP3/Opus or another sample rate, the worker will inject encoded/container bytes as PCM noise. The code needs an explicit decode/resample step to the worker's required format, or the worker protocol should carry MIME/container metadata and decode there.
- gRPC stream subscriptions leak delegates.
BridgeService.SubscribeUnmixedAudio and SubscribeEvents add callbacks to session.AudioSubscribers / session.EventSubscribers but never remove them in finally. ConcurrentBag has no removal path, so every disconnected TS client leaves a dead subscriber attached for the lifetime of the call. Use a removable subscription collection or per-subscriber channel registry and remove on stream completion/cancellation.
- Speaker identity is currently fabricated.
CallHandler.ResolveSpeaker() iterates participants but never maps the media speaker id to a participant; it always returns AadUserId = speakerId.ToString() and DisplayName = Speaker-{speakerId}. That means the TS prompt path will attribute real participants to synthetic speaker labels. If per-speaker unmixed audio is part of the value prop, this needs a real mapping or the user-visible docs should not claim identified speakers yet.
- The public SDK seam is not justified by this PR.
src/plugin-sdk/msteams.ts exports MSTeamsVoiceConfig and MSTeamsVoicePermissionMode, creating a public SDK contract for an implementation that has not proven the worker protocol, permission model, or transport security. Keep the first pass plugin-private, or land the SDK addition separately with explicit versioned contract review.
Suggested split: first land a tiny config/status capability tier PR with tests. Then a worker protocol PR with secure transport and build proof. Then audio/STT/TTS integration with real format conversion and a live worker smoke. This PR currently bundles all of those risk surfaces together.
|
This pull request has been automatically marked as stale due to inactivity. |
|
Codex review: found issues before merge. Summary Reproducibility: yes. for the review blockers: static inspection of the PR head reproduces the insecure bridge, undeclared gRPC imports, raw TTS-file-to-PCM playback path, receive-only audio socket, and floating NuGet ranges. No live Teams E2E reproduction is available from the supplied context, and the PR test plan leaves that proof unchecked. Next step before merge Security Review findings
Review detailsBest possible solution: Land a narrower owner-reviewed sequence: capability-tier docs and config first, then an authenticated worker bridge, deterministic .NET project, compliance-gated audio path, and live/RSC proof in separate PRs. Do we have a high-confidence way to reproduce the issue? Yes for the review blockers: static inspection of the PR head reproduces the insecure bridge, undeclared gRPC imports, raw TTS-file-to-PCM playback path, receive-only audio socket, and floating NuGet ranges. No live Teams E2E reproduction is available from the supplied context, and the PR test plan leaves that proof unchecked. Is this the best way to solve the issue? No. The current all-in-one PR is not the best way to solve the feature because it combines contract, worker infrastructure, security-sensitive media control, compliance behavior, and realtime audio before the dependency and permission contracts are proven. Full review comments:
Overall correctness: patch is incorrect Security concerns:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against cc8a8f1df1cd. |
Summary
ReceiveUnmixedMeetingAudiofor per-speaker audio capturedocs/channels/msteams-voice.mdArchitecture
live_voice(worker available) →transcript_mode(post-meeting artifacts) →text_onlyFiles
TS voice module (11 new files in
extensions/msteams/src/voice/):types, config, compliance-gate, worker-bridge, manager, streaming-stt, cut-through-tts, own-voice-filter, audio-pipeline, transcript-fallback
.NET media worker (10 new files in
extensions/msteams/media-worker/):CallHandler, ComplianceGate, UnmixedAudioCapture, AudioPlayback, QoEMonitor, WorkerRegistry, BridgeService, bridge.proto
Other: manifest template, docs page, config types, plugin schema
Test plan
pnpm checkpasses (verified)dotnet buildCalls.AccessMedia.Chatfor app-hosted media🤖 Generated with Claude Code