voice-live-universal-assistant

Voice Live Web Samples

Web-based code samples for the Azure Speech Voice Live API featuring a shared React frontend and language-specific backends.

Architecture

┌─────────────────────┐   WebSocket    ┌──────────────────┐   Voice Live SDK   ┌──────────────┐
│  React + Vite       │◄──────────────►│  Backend Server  │◄─────────────────►│  Azure Voice  │
│  (shared frontend)  │  JSON + PCM16  │  (Python/JS/…)   │   PCM16 + events  │  Live Service │
│                     │  or JSON text  │                  │                    │              │
│  Voice mode: mic    │                │                  │                    │              │
│  Text mode: chat    │                │                  │                    │              │
└─────────────────────┘                └──────────────────┘                    └──────────────┘

The frontend builds to static files served by the backend — no separate frontend server needed in production.

Prerequisites

Node.js 20+ and npm (for building the frontend; also for the JavaScript backend)
Python 3.9+ (for the Python backend)
Java 17+ and Maven 3.8+ (for the Java backend)
.NET 8.0 SDK (for the C# backend)
An Azure AI Services resource with Voice Live API access

Authentication

Recommended (RBAC): Use DefaultAzureCredential — no API keys required.

az login   # Local development — authenticates via Azure CLI

For deployed environments, the azd infrastructure provisions a system-assigned managed identity with the Cognitive Services User role, enabling token-based auth without any keys.

Fallback (API key): Set AZURE_VOICELIVE_API_KEY in .env only if token-based auth is unavailable for your resource.

Quick Start (Python)

1. Build the frontend

cd frontend
npm install
npm run build

This creates frontend/dist/ with the static files that the backend will serve.

2. Set up the Python backend

cd python

# Create and activate a virtual environment
python -m venv .venv

# Activate the venv
# Windows (PowerShell):
.venv\Scripts\Activate.ps1
# Windows (cmd):
.venv\Scripts\activate.bat
# macOS / Linux:
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

3. Configure environment variables

cp .env.sample .env

Edit .env with your credentials:

# Required
AZURE_VOICELIVE_ENDPOINT=https://your-resource.cognitiveservices.azure.com/

# Authentication: DefaultAzureCredential is used by default (az login).
# Set API key below only as a fallback if token auth is unavailable.
AZURE_VOICELIVE_API_KEY=

# Connection mode: "model" (default) or "agent" (Foundry Agent Service)
VOICELIVE_MODE=model

# Model mode settings (default — works with just a Foundry resource)
VOICELIVE_MODEL=gpt-realtime
VOICELIVE_VOICE=en-US-Ava:DragonHDLatestNeural
VOICELIVE_TRANSCRIBE_MODEL=gpt-4o-transcribe

# Agent mode settings (when VOICELIVE_MODE=agent)
AZURE_VOICELIVE_AGENT_NAME=your-agent-name
AZURE_VOICELIVE_PROJECT=your-project-name

4. Run the server

python app.py

Open http://localhost:8000 in your browser. Click Start session and allow microphone access when prompted.

Quick Start (Java)

1. Build the frontend

cd frontend
npm install
npm run build

2. Set up the Java backend

cd java
cp .env.sample .env
# Edit .env with your Azure Voice Live endpoint

3. Build and run

mvn clean package -DskipTests
java -jar target/voice-live-universal-assistant-1.0.0.jar

Or with Maven directly:

mvn spring-boot:run

Open http://localhost:8000 in your browser.

Note: See the Java backend README for environment and ecosystem notes.

Quick Start (JavaScript / Node.js)

1. Build the frontend

cd frontend
npm install
npm run build

2. Set up the Node.js backend

cd javascript
npm install
cp .env.sample .env
# Edit .env with your Azure Voice Live endpoint

3. Run

npm start

Open http://localhost:8000 in your browser.

Quick Start (C# / ASP.NET Core)

1. Build the frontend

cd frontend
npm install
npm run build

2. Set up the C# backend

cd csharp
cp .env.sample .env
# Edit .env with your Azure Voice Live endpoint

3. Build and run

dotnet run

Open http://localhost:8000 in your browser.

Connection Modes

Mode	Use case	How it works
`model`	Direct model access / BYOM (default)	Caller configures model, voice, system prompt. Works with just an endpoint — no agent setup required. Set `VOICELIVE_MODEL` and `VOICELIVE_VOICE`.
`agent`	Foundry Agent Service integration	Agent defines instructions, tools, and voice. Set `AZURE_VOICELIVE_AGENT_NAME` and `AZURE_VOICELIVE_PROJECT`. Auto-set when deploying with `CREATE_AGENT=true`.

Switch modes by setting VOICELIVE_MODE in .env or via the Settings panel in the UI.

Frontend URL Query Parameters

The frontend accepts URL query parameters to customize behavior without changing server configuration. These parameters control UI appearance, input mode, and pre-fill settings.

Parameter	Values	Default	Description
`?mode=voice\|text`	`voice`, `text`	`voice`	Lock the input mode. `voice` = microphone-based interaction with real-time audio. `text` = chat-style text input. When set, the mode toggle in the menu is hidden.
`?lock=true`	`true`	`false`	Kiosk/embed mode — hides the settings gear and mode toggle entirely. The app uses only server-provided configuration.
`?agent=NAME`	any string	—	Pre-fills the agent name in settings and automatically sets connection mode to `agent` (when used together with `?project=`).
`?project=NAME`	any string	—	Pre-fills the project name in settings and automatically sets connection mode to `agent` (when used together with `?agent=`).
`?theme=light\|dark`	`light`, `dark`	system	Override the UI theme. Without this, the app follows the system preference.
`?greeting=false`	`false`	`true`	Disable the proactive greeting that the agent sends when a session starts.

Examples:

# Voice mode with agent pre-configured (typical embed scenario)
https://your-app.azurecontainerapps.io/?agent=my-agent&project=my-project&lock=true

# Text mode for chat-style interaction
https://your-app.azurecontainerapps.io/?mode=text

# Light theme, no greeting
https://your-app.azurecontainerapps.io/?theme=light&greeting=false

Text Mode

The text mode (?mode=text) provides a chat-style text input interface as an alternative to voice interaction.

Important: Text mode is powered by the Voice Live API's text modality, not the Foundry Agent Service REST API. The same WebSocket connection and Voice Live session are used — the only difference is that user input is sent as text messages (send_text) instead of audio chunks. The agent's response audio can be optionally played back or muted.

Text mode features:

Chat-style message bubbles with auto-scroll
Audio playback toggle (speaker icon) — defaults to off; when enabled, the agent's spoken response audio plays through the browser
Barge-in — sending a new message automatically interrupts any playing agent audio
Text input bar with send button, positioned within the main content area
Dismiss (✕) button to end the session

Text vs Voice comparison:

Aspect	Voice Mode	Text Mode
User input	Microphone audio (PCM16)	Typed text messages
Agent response	Audio playback (always on)	Text bubbles (audio optional, default off)
Barge-in	Speaking interrupts agent	Sending text interrupts agent
Transport	Voice Live WebSocket	Voice Live WebSocket (same)
API	Voice Live text+audio modality	Voice Live text modality
Backend	Same session handler	Same session handler

Project Structure

voice-live-universal-assistant/
├── frontend/                  # Shared React + Vite + TypeScript frontend
│   ├── public/
│   │   ├── audio-capture-worklet.js    # Mic capture AudioWorklet (24kHz PCM16)
│   │   └── audio-playback-worklet.js   # Audio playback AudioWorklet
│   └── src/
│       ├── components/        # UI components (VoiceOrb, StartScreen, etc.)
│       ├── hooks/             # React hooks (useAudioCapture, useAudioPlayback, useVoiceSession)
│       ├── types.ts           # Shared TypeScript types
│       ├── App.tsx            # Root component
│       └── main.tsx           # Entry point
├── python/                    # Python backend (FastAPI + Voice Live SDK)
│   ├── app.py                 # FastAPI server with WebSocket endpoint
│   ├── voice_handler.py       # VoiceLiveHandler — SDK bridge
│   ├── tests/                 # 91 automated tests (settings + agent mode)
│   ├── requirements.txt       # Python dependencies
│   ├── .env.sample            # Environment variable template
│   └── README.md              # Python-specific docs
├── java/                      # Java backend (Spring Boot + Voice Live SDK)
│   ├── src/                   # Spring Boot application source
│   ├── pom.xml                # Maven config (azure-ai-voicelive 1.0.0-beta.5)
│   ├── .env.sample            # Environment variable template
│   └── README.md              # Java-specific docs
├── javascript/                # JavaScript/Node.js backend (Express + Voice Live SDK)
│   ├── app.js                 # Express server with WebSocket endpoint
│   ├── voiceHandler.js        # VoiceHandler — SDK bridge
│   ├── package.json           # npm config (@azure/ai-voicelive 1.0.0-beta.3)
│   ├── .env.sample            # Environment variable template
│   └── README.md              # JavaScript-specific docs
├── csharp/                    # C# ASP.NET Core backend (Voice Live SDK)
│   ├── Program.cs             # ASP.NET Core minimal API + WebSocket middleware
│   ├── VoiceLiveHandler.cs    # VoiceLiveHandler — SDK bridge
│   ├── SessionConfig.cs       # Session configuration POCO
│   ├── VoiceLiveWebApp.csproj # .NET project (Azure.AI.VoiceLive 1.1.0-beta.3)
│   ├── .env.sample            # Environment variable template
│   └── README.md              # C#-specific docs
├── infra/                     # Azure Bicep IaC
│   ├── main.bicep             # Entry point (Container Apps + optional Foundry + Agent)
│   ├── main-app.bicep         # Container App with Voice Live env vars
│   ├── main-infrastructure.bicep  # Log Analytics, ACR, Container Apps Env
│   ├── modules/
│   │   ├── foundry.bicep      # AI Foundry account + project (optional)
│   │   └── foundry-rbac.bicep # Azure AI User role for tracing
│   └── core/host/             # Reusable modules (container-app, container-registry)
├── deployment/
│   ├── hooks/
│   │   ├── postprovision.ps1  # RBAC assignment (+ Foundry RBAC when enabled)
│   │   ├── predeploy.ps1      # ACR cloud build + Container App update
│   │   └── postdeploy.ps1     # Foundry Agent creation (when createAgent=true)
│   └── scripts/
│       └── create_agent.py    # Agent creation with Voice Live metadata
├── tests/                     # E2E test suite (WebSocket + Playwright)
└── README.md                  # This file

Deployment (Azure Developer CLI)

Option 1: Basic — Container App only (default)

Deploys the web app connecting to your existing Azure AI Services resource in model mode (no agent required):

azd auth login
azd init

# Required: set your Voice Live endpoint
azd env set AZURE_VOICELIVE_ENDPOINT "https://your-resource.cognitiveservices.azure.com/"

# Optional: choose backend language (default: python)
azd env set BACKEND_LANGUAGE java   # python | java | javascript | csharp

# Optional: API key (only if token auth is unavailable for your resource)
azd env set AZURE_VOICELIVE_API_KEY "your-api-key"

azd up

Want agent mode instead? See Option 3 for a fully automated setup, or configure manually:
azd env set VOICELIVE_MODE agent
azd env set AZURE_VOICELIVE_AGENT_NAME "your-agent-name"
azd env set AZURE_VOICELIVE_PROJECT "your-project-name"

This provisions:

Container Apps Environment with Log Analytics
Container Registry (ACR cloud build — no local Docker required)
Container App with system-assigned managed identity
RBAC — Cognitive Services User for token-based auth

Option 2: With Foundry — Create AI Foundry + model mode

Provisions a new AI Foundry resource with gpt-realtime model deployment and configures the app for model mode — no additional configuration required:

azd auth login
azd init
azd env set CREATE_FOUNDRY true
# Optional: choose backend language (default: python)
azd env set BACKEND_LANGUAGE java
azd up

This adds (fully automatic — no manual endpoint/model config needed):

AI Services Account (kind: AIServices) with system-assigned identity
AI Foundry Project under the account
gpt-4o-realtime-preview model deployment (as gpt-realtime)
Azure AI User + Azure AI Developer roles
Container App configured with provisioned endpoint + model mode

Option 3: With Agent — Foundry + GPT-4.1-mini + Foundry Agent

Full end-to-end: provisions Foundry, deploys GPT-4.1-mini, and creates an agent with Voice Live configuration — no additional configuration required:

azd auth login
azd init
azd env set CREATE_AGENT true
# Optional: customize agent name (default: voicelive-assistant)
azd env set AGENT_NAME "my-voice-assistant"
azd up

Note: CREATE_AGENT automatically enables CREATE_FOUNDRY — you don't need to set both.

This adds (fully automatic):

Everything from Option 2
GPT-4.1-mini model deployment (for the agent)
Foundry Agent created via Python SDK with Voice Live session config (Azure voice, semantic VAD, noise suppression, echo cancellation)
Container App configured with agent name, project, and agent mode

Deployment parameters

Parameter	Default	Description
`BACKEND_LANGUAGE`	`python`	Backend language: `python`, `java`, `javascript`, `csharp`
`AZURE_VOICELIVE_ENDPOINT`	—	Voice Live endpoint (required for basic, auto-set with Foundry)
`VOICELIVE_MODE`	`model`	Connection mode (`model` by default; auto-set to `agent` when `CREATE_AGENT=true`)
`AZURE_VOICELIVE_AGENT_NAME`	—	Agent name (auto-set when `CREATE_AGENT=true`)
`AZURE_VOICELIVE_PROJECT`	—	Foundry project (auto-set when Foundry provisioned)
`CREATE_FOUNDRY`	`false`	Create AI Foundry account + project + model
`CREATE_AGENT`	`false`	Create Foundry Agent (implies `CREATE_FOUNDRY`; sets mode to `agent`)
`FOUNDRY_ACCOUNT_NAME`	auto-generated	Custom name for the AI Services account
`FOUNDRY_PROJECT_NAME`	`voicelive-project`	Name for the Foundry project
`AGENT_MODEL_DEPLOYMENT_NAME`	`gpt-4.1-mini`	Model deployment name for agent
`AGENT_NAME`	`voicelive-assistant`	Name for the created agent

Development

For local development with hot-reload on both frontend and backend:

Terminal 1 — Frontend dev server (with proxy to backend):

cd frontend
npm run dev

Terminal 2 — Python backend:

cd python
.venv\Scripts\Activate.ps1   # or source .venv/bin/activate
python app.py

The Vite dev server at http://localhost:5173 proxies WebSocket and API calls to http://localhost:8000.

WebSocket Protocol

The frontend and backend communicate over WebSocket at /ws/{clientId}.

Direction	Message	Description
Client → Server	`start_session`	Begin voice session with config
Client → Server	`audio_chunk`	Base64 PCM16 mic audio (24kHz, mono)
Client → Server	`interrupt`	Cancel current agent response
Client → Server	`send_text`	Send text message (text mode)
Client → Server	`stop_session`	End the session
Server → Client	`session_started`	Session ready, includes config and session_id
Server → Client	`audio_data`	Base64 PCM16 agent audio response
Server → Client	`transcript`	User or assistant transcript text
Server → Client	`status`	State change (listening/thinking/speaking)
Server → Client	`stop_playback`	Stop audio playback (barge-in)
Server → Client	`session_stopped`	Session ended
Server → Client	`error`	Error message

SDK Versions & Known Issues

All backends pin the API version to 2026-01-01-preview (the SDK defaults to GA which lacks agent mode and interim response support).

Backend	SDK	Version	Language-Specific Notes
Python	`azure-ai-voicelive`	1.0.0b1	No known limitations
Java	`azure-ai-voicelive`	1.0.0-beta.5	`.env` loaded via custom parser; Netty version mismatch warning (no runtime impact)
JavaScript	`@azure/ai-voicelive`	1.0.0-beta.3	Node.js 20+ required. No known limitations
C#	`Azure.AI.VoiceLive`	1.1.0-beta.3	No known limitations

Frontend UX Guards

Interim Response is disabled (greyed out) when a realtime model is selected in model mode — it only works with agent mode or text models using cascaded pipelines (Azure Speech transcription).
Start Session button is disabled when in agent mode and Agent Name or Project are empty, with a helper message directing the user to Settings.
Transcription model is auto-corrected to azure-speech when a text model is selected (cascaded pipelines only support azure-speech).
Agent mode toggle is disabled in Settings when no agent is configured on the server (agent name and project both empty). This prevents users from switching to agent mode when the server has no agent setup.
Audio playback in text mode defaults to off. When muted, audio data from the service is completely skipped (not decoded) to avoid unnecessary resource usage.

Validation Guard Matrix

Shows where each validation is enforced — frontend-only guards rely on the UI to prevent invalid input, while backend guards provide server-side enforcement.

Guard	Frontend	Python	Java	JavaScript	C#
Agent mode requires name + project	✅ Disables Start btn	— SDK validates	✅ Falls back to model	— SDK validates	✅ Falls back to model
Transcribe model auto-correction	✅ On model change	—	—	✅ For agent/cascaded	✅ For cascaded models
Interim response disabled for realtime	✅ Greys out toggle	—	—	—	❌ SDK gap (ignored)
Session cleanup on start failure	—	✅	✅	✅	✅
Auth identity requires resource override	—	✅	✅	✅	✅

Legend: ✅ = enforced, — = not needed / passed through to SDK, ❌ = not supported

Backend Feature Matrix

Feature	Python	Java	JavaScript	C#
Model mode (realtime)	✅	✅	✅	✅
Model mode (text/cascaded)	✅	✅	✅	✅
Agent mode	✅	✅	✅	✅
Interim response	✅	✅	✅	❌ SDK gap
Echo cancellation	✅	✅	✅	✅
Noise reduction	✅	✅	✅	✅

Future Improvements

Agent mode fail-fast: When mode=agent but agentName or projectName are missing, the C# and Java backends silently fall back to model mode. The frontend already prevents this (Start button is disabled until both fields are set), but the backends should return an explicit error instead of downgrading. Python and JavaScript pass the config through and let the SDK validate.
Align backend validation guards: As shown in the Validation Guard Matrix, transcribe model auto-correction and interim response guards are inconsistent across backends. Python and Java rely entirely on the frontend for these validations. All backends should enforce the same server-side guards to prevent invalid configurations when the frontend is bypassed (e.g., direct WebSocket clients).

Testing

An E2E test suite is available in tests/e2e_all_backends.py covering all four backends with two test types:

WebSocket tests — connect directly to the backend WebSocket endpoint, send a start_session message, stream real WAV audio as PCM16 chunks, and verify that audio and transcript responses are received.
Playwright browser tests — open the frontend UI in a headless Chromium browser with a mocked microphone (oscillator tone), click Start, and verify the page loads, the voice orb renders, and a session becomes active.

Prerequisites

pip install websockets playwright
python -m playwright install chromium

WebSocket tests use WAV audio files when available, or fall back to a synthetic 440Hz sine wave. Set E2E_AUDIO_DIR to a folder containing .wav files for real speech testing. Backend URLs can be overridden via environment variables (E2E_PYTHON_URL, E2E_CSHARP_URL, E2E_JAVASCRIPT_URL, E2E_JAVA_URL).

Running Tests

cd voice-live-universal-assistant

# All backends, both test types (model mode)
python tests/e2e_all_backends.py

# WebSocket tests only
python tests/e2e_all_backends.py --ws-only

# Playwright browser tests only
python tests/e2e_all_backends.py --browser-only

# Agent mode (default is model)
python tests/e2e_all_backends.py --mode agent

# Single backend URL
python tests/e2e_all_backends.py --url https://your-backend.azurecontainerapps.io

Backend URLs default to the development deployment and can be overridden via environment variables.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
parent directory ..
csharp		csharp
deployment		deployment
frontend		frontend
infra		infra
java		java
javascript		javascript
python		python
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile.csharp		Dockerfile.csharp
Dockerfile.java		Dockerfile.java
Dockerfile.javascript		Dockerfile.javascript
Dockerfile.python		Dockerfile.python
README.md		README.md
azure.yaml		azure.yaml

FilesExpand file tree

voice-live-universal-assistant

Directory actions

More options

Directory actions

More options

Latest commit

History

voice-live-universal-assistant

Folders and files

parent directory

README.md

Voice Live Web Samples

Architecture

Prerequisites

Authentication

Quick Start (Python)

1. Build the frontend

2. Set up the Python backend

3. Configure environment variables

4. Run the server

Quick Start (Java)

1. Build the frontend

2. Set up the Java backend

3. Build and run

Quick Start (JavaScript / Node.js)

1. Build the frontend

2. Set up the Node.js backend

3. Run

Quick Start (C# / ASP.NET Core)

1. Build the frontend

2. Set up the C# backend

3. Build and run

Connection Modes

Frontend URL Query Parameters

Text Mode

Project Structure

Deployment (Azure Developer CLI)

Option 1: Basic — Container App only (default)

Option 2: With Foundry — Create AI Foundry + model mode

Option 3: With Agent — Foundry + GPT-4.1-mini + Foundry Agent

Deployment parameters

Development

WebSocket Protocol

SDK Versions & Known Issues

Frontend UX Guards

Validation Guard Matrix

Backend Feature Matrix

Future Improvements

Testing

Prerequisites

Running Tests

License