TTS: support per-request instructions/style (unblocks Qwen3-TTS VoiceDesign & per-line emotion)

**Is your feature request related to a problem? Please describe.**

The OpenAI-compatible TTS endpoint (`POST /v1/audio/speech`) accepts an `instructions` field, but there is currently no way to pass any per-request style/voice instruction to a TTS backend. Two layers cause this:

1. The gRPC `TTSRequest` message only carries `text`, `model`, `dst`, `voice`, `language`. There is no field for an instruction/style/description string, so the OpenAI `instructions` value is silently dropped at the HTTP→gRPC boundary and never reaches the backend.
2. Backends can therefore only read such a value from static YAML `options`. Example: the `qwen-tts` backend reads `instruct = self.options.get("instruct", "")` — a model-config-level setting, identical for every request, changeable only by editing the model YAML and reloading.

This blocks two distinct, important capabilities of modern TTS models:

- **Per-line emotion/style** (e.g. Qwen3-TTS CustomVoice, Chatterbox): the speaker/voice stays fixed, but each request should be able to set a different tone.
- **VoiceDesign / describe-a-voice generation** (Qwen3-TTS VoiceDesign): here the instruction string IS the voice definition (identity + timbre + emotion in one description). Because it can only be set as a single static YAML option today, **VoiceDesign is effectively limited to one designed voice per model config** — you cannot use multiple designed voices (one per character/line) at all. This is a hard blocker for interactive clients (roleplay/narration via e.g. SillyTavern, Marinara Engine) that need many distinct voices.

**Describe the solution you'd like**

Add a generic, optional per-request instruction/style field to the TTS API and plumb it end to end:

1. Add an optional field to the gRPC `TTSRequest` proto, e.g. `string instructions = N;` (a single free-form string; optionally also a small `map<string,string>` for backend-specific numeric params).
2. Map the OpenAI `instructions` field from `POST /v1/audio/speech` into that gRPC field in the core handler.
3. In each backend, prefer the per-request value when present and fall back to the existing YAML `options` value when empty (fully backward compatible).

For the `qwen-tts` backend this is a small change: it already auto-selects mode (AudioPath → VoiceClone, `instruct` → VoiceDesign, `voice` → CustomVoice) and already consumes an `instruct` string — it just currently reads it only from `self.options`. Routing a per-request value into that same path would immediately enable:
- CustomVoice: per-request emotion on the 9 preset speakers.
- VoiceDesign: a different voice description per request (i.e. actually usable for many voices).

**Describe alternatives you've considered**

- **Static YAML `instruct`/options**: one fixed instruction per model config. For CustomVoice that means one fixed emotion; for VoiceDesign it means one fixed voice — neither works for dynamic, per-request use.
- **One model config per voice/mood + switching models per request**: huge overhead, extra VRAM churn from reloads, and impractical for dozens of voices.
- **A custom OpenAI-compatible bridge in front of the model**: re-implements the TTS API outside LocalAI and duplicates work; defeats the purpose of using LocalAI's TTS endpoint.
- **Client-side workarounds**: impossible — the value cannot survive the gRPC schema, so no client can deliver it.

**Additional context**

- This generalizes #5979 ("Expose Chatterbox TTS model arguments on the Chatterbox backend"): same class of problem (per-request TTS params not exposed), but backend-specific. A generic per-request instruction field would cover Chatterbox, Qwen3-TTS CustomVoice/VoiceDesign, and future expressive backends.
- Backends that would benefit immediately: `qwen-tts` (Qwen3-TTS-12Hz-1.7B-CustomVoice / -VoiceDesign), Chatterbox.
- Observed on the `cuda13-qwen-tts` backend; `TTSRequest` field set confirmed as `text, model, dst, voice, language` (compiled `backend_pb2`).
- LocalAI image: `localai/localai:master-gpu-nvidia-cuda-13`, build `437f0fa` (master, pulled ~2026-05-25).
- Backends incapable of style control (e.g. pure Base voice-cloning) would simply ignore the field — the request is only about not dropping it before the backend can decide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TTS: support per-request instructions/style (unblocks Qwen3-TTS VoiceDesign & per-line emotion) #10164

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

TTS: support per-request instructions/style (unblocks Qwen3-TTS VoiceDesign & per-line emotion) #10164

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions