Skip to content

TTS: support per-request instructions/style (unblocks Qwen3-TTS VoiceDesign & per-line emotion) #10164

@GerdW

Description

@GerdW

Is your feature request related to a problem? Please describe.

The OpenAI-compatible TTS endpoint (POST /v1/audio/speech) accepts an instructions field, but there is currently no way to pass any per-request style/voice instruction to a TTS backend. Two layers cause this:

  1. The gRPC TTSRequest message only carries text, model, dst, voice, language. There is no field for an instruction/style/description string, so the OpenAI instructions value is silently dropped at the HTTP→gRPC boundary and never reaches the backend.
  2. Backends can therefore only read such a value from static YAML options. Example: the qwen-tts backend reads instruct = self.options.get("instruct", "") — a model-config-level setting, identical for every request, changeable only by editing the model YAML and reloading.

This blocks two distinct, important capabilities of modern TTS models:

  • Per-line emotion/style (e.g. Qwen3-TTS CustomVoice, Chatterbox): the speaker/voice stays fixed, but each request should be able to set a different tone.
  • VoiceDesign / describe-a-voice generation (Qwen3-TTS VoiceDesign): here the instruction string IS the voice definition (identity + timbre + emotion in one description). Because it can only be set as a single static YAML option today, VoiceDesign is effectively limited to one designed voice per model config — you cannot use multiple designed voices (one per character/line) at all. This is a hard blocker for interactive clients (roleplay/narration via e.g. SillyTavern, Marinara Engine) that need many distinct voices.

Describe the solution you'd like

Add a generic, optional per-request instruction/style field to the TTS API and plumb it end to end:

  1. Add an optional field to the gRPC TTSRequest proto, e.g. string instructions = N; (a single free-form string; optionally also a small map<string,string> for backend-specific numeric params).
  2. Map the OpenAI instructions field from POST /v1/audio/speech into that gRPC field in the core handler.
  3. In each backend, prefer the per-request value when present and fall back to the existing YAML options value when empty (fully backward compatible).

For the qwen-tts backend this is a small change: it already auto-selects mode (AudioPath → VoiceClone, instruct → VoiceDesign, voice → CustomVoice) and already consumes an instruct string — it just currently reads it only from self.options. Routing a per-request value into that same path would immediately enable:

  • CustomVoice: per-request emotion on the 9 preset speakers.
  • VoiceDesign: a different voice description per request (i.e. actually usable for many voices).

Describe alternatives you've considered

  • Static YAML instruct/options: one fixed instruction per model config. For CustomVoice that means one fixed emotion; for VoiceDesign it means one fixed voice — neither works for dynamic, per-request use.
  • One model config per voice/mood + switching models per request: huge overhead, extra VRAM churn from reloads, and impractical for dozens of voices.
  • A custom OpenAI-compatible bridge in front of the model: re-implements the TTS API outside LocalAI and duplicates work; defeats the purpose of using LocalAI's TTS endpoint.
  • Client-side workarounds: impossible — the value cannot survive the gRPC schema, so no client can deliver it.

Additional context

  • This generalizes Expose Chatterbox TTS model arguments on the Chatterbox backend #5979 ("Expose Chatterbox TTS model arguments on the Chatterbox backend"): same class of problem (per-request TTS params not exposed), but backend-specific. A generic per-request instruction field would cover Chatterbox, Qwen3-TTS CustomVoice/VoiceDesign, and future expressive backends.
  • Backends that would benefit immediately: qwen-tts (Qwen3-TTS-12Hz-1.7B-CustomVoice / -VoiceDesign), Chatterbox.
  • Observed on the cuda13-qwen-tts backend; TTSRequest field set confirmed as text, model, dst, voice, language (compiled backend_pb2).
  • LocalAI image: localai/localai:master-gpu-nvidia-cuda-13, build 437f0fa (master, pulled ~2026-05-25).
  • Backends incapable of style control (e.g. pure Base voice-cloning) would simply ignore the field — the request is only about not dropping it before the backend can decide.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions