Edit

Use Voice Live with hosted agents (preview)

Here's how to integrate Azure Voice Live with your Microsoft Foundry hosted agents to enable real-time voice interaction.

Once you deploy a hosted agent to Microsoft Foundry, you can add real-time voice interaction using the Azure VoiceLive SDK.

Hosted agents support two protocols — Responses and Invocations. Voice Live works with both with a little difference. This article covers both protocols.

Prerequisites

Install the required packages:

pip install azure-ai-voicelive[aiohttp]==1.2.0b5 azure-identity pyaudio

Note

On Linux, install PortAudio first: sudo apt-get install -y portaudio19-dev libasound2-dev

Use Voice Live with a Responses protocol agent

For agents that use the Responses protocol, Voice Live connects directly through the AgentSessionConfig.

agent_config: AgentSessionConfig = {
    "agent_name": "<your-agent-name>",
    "project_name": "<your-foundry-project-name>",
}

Configure the session

The following example shows how to initialize a Voice Live session with a hosted agent using the AgentSessionConfig:

from azure.ai.voicelive.aio import connect, AgentSessionConfig
from azure.ai.voicelive.models import (
    AudioEchoCancellation,
    AudioNoiseReduction,
    AzureStandardVoice,
    InputAudioFormat,
    Modality,
    OutputAudioFormat,
    RequestSession,
    ServerVad,
)
from azure.identity.aio import DefaultAzureCredential

endpoint = "https://<your-foundry-resource>.services.ai.azure.com"
credential = DefaultAzureCredential()

agent_config: AgentSessionConfig = {
    "agent_name": "<your-agent-name>",
    "project_name": "<your-foundry-project-name>",
}

async with connect(
    endpoint=endpoint,
    credential=credential,
    agent_config=agent_config,
) as connection:
    # Configure session settings
    session_config = RequestSession(
        modalities=[Modality.TEXT, Modality.AUDIO],
        voice=AzureStandardVoice(name="en-US-Ava:DragonHDLatestNeural"),
        input_audio_format=InputAudioFormat.PCM16,
        output_audio_format=OutputAudioFormat.PCM16,
        turn_detection=ServerVad(),
        input_audio_echo_cancellation=AudioEchoCancellation(),
        input_audio_noise_reduction=AudioNoiseReduction(
            type="azure_deep_noise_suppression"
        ),
    )
    await connection.session.update(session=session_config)

Reference: VoiceLive SDK (azure-ai-voicelive) | AgentSessionConfig

Process events

After configuring the session, capture microphone audio and handle events from the connection. The key events for voice interaction are:

Event Description
SESSION_UPDATED Session is ready. Start audio capture.
INPUT_AUDIO_BUFFER_SPEECH_STARTED User started speaking. Skip any queued playback audio.
INPUT_AUDIO_BUFFER_SPEECH_STOPPED User stopped speaking.
RESPONSE_AUDIO_DELTA Incremental audio from the agent. Queue for playback.
RESPONSE_AUDIO_DONE Agent finished speaking.
CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED Transcription of user speech.

Sample Voice Live client

For a complete working example, see voicelive_client.py on GitHub.

Use the shared Voice Live client to connect to your deployed agent. The client authenticates with DefaultAzureCredential, so sign in first with az login.

python voicelive_client.py \
  --endpoint "https://<your-foundry-resource>.services.ai.azure.com" \
  --agent-name "<your-agent-name>" \
  --project-name "<your-foundry-project-name>"

Speak into your microphone. The agent responds with synthesized speech. Press Ctrl+C to end the session.

Use Voice Live with an Invocations protocol agent

For agents that use the Invocations protocol, Voice Live handles speech-to-text and text-to-speech while the agent processes structured input and returns streaming text.

Make an Invocations agent compatible with Voice Live

An Invocations agent must meet three requirements to work with Voice Live:

  1. Accept voice transcription input: The agent must process incoming messages in this format:

    {"type": "input_audio.transcription", "input": "example voice input"}
    
  2. Return streaming text as SSE events: The agent must return text to be spoken as server-sent events (SSE) using the output_audio_transcription event types. Voice Live synthesizes audio from the delta text:

    data: {"type": "output_audio_transcription.delta", "delta": "The weather "}
    data: {"type": "output_audio_transcription.delta", "delta": "in Seattle "}
    data: {"type": "output_audio_transcription.delta", "delta": "is 52°F "}
    data: {"type": "output_audio_transcription.delta", "delta": "and partly cloudy."}
    data: {"type": "output_audio_transcription.done", "text": "The weather in Seattle is 52°F and partly cloudy."}
    data: {"type": "done"}
    
  3. Declare Voice Live compatibility in the agent manifest: Add voiceLiveCompatible: "true" to the metadata section of your agent.manifest.yaml:

    metadata:
      voiceLiveCompatible: "true"
    

For a complete Invocations agent sample compatible with Voice Live, see hello-world-invocations-voicelive on GitHub.

Connect with Voice Live

Once your Invocations agent is deployed, you can start a voice conversation using the same Voice Live client, similar to a Responses protocol agent. See Sample Voice Live client for the connection steps.