Skip to content

feat(tts): support per-request instructions and params#10172

Merged
mudler merged 1 commit into
masterfrom
worktree-tts-per-request-instructions
Jun 4, 2026
Merged

feat(tts): support per-request instructions and params#10172
mudler merged 1 commit into
masterfrom
worktree-tts-per-request-instructions

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Summary

Closes #10164.

The OpenAI-compatible TTS endpoint (POST /v1/audio/speech) accepts an instructions field, but it was silently dropped at the HTTP→gRPC boundary: neither schema.TTSRequest nor the gRPC TTSRequest proto carried it, so backends could only read such a value from static YAML options (identical for every request). This blocked:

  • Per-line emotion/style (Qwen3-TTS CustomVoice, Chatterbox): same speaker, different tone per request.
  • VoiceDesign / describe-a-voice (Qwen3-TTS VoiceDesign): the instruction string is the voice, so a model config was limited to a single designed voice.

This PR plumbs a generic per-request instruction string end to end, plus an optional backend-specific params map.

Changes

  • proto (backend/backend.proto): add optional string instructions = 6 and map<string,string> params = 7 to TTSRequest.
  • schema (core/schema/localai.go): add Instructions (maps the OpenAI instructions field) and Params (LocalAI extension).
  • core (core/backend/tts.go): thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper that attaches instructions only when non-empty (so backends can fall back to YAML when unset). Forwarded from the /v1/audio/speech handler; other callers (cli, elevenlabs, realtime) pass empty values.
  • qwen-tts (backend/python/qwen-tts/backend.py): prefer the per-request instruction over the YAML instruct option (used by both mode detection and generation) and merge per-request params.
  • chatterbox (backend/python/chatterbox/backend.py): merge per-request params (coerced to float/int/bool) over YAML options into generate() kwargs.
  • docs + regenerated swagger.

Backward compatibility

Fully compatible: empty instructions falls back to the YAML option, and backends that don't support style/voice instructions simply ignore the field. The params map values arrive as strings and are coerced by the backend.

Example

curl http://localhost:8080/v1/audio/speech -H "Content-Type: application/json" -d '{
  "model": "qwen-tts-design",
  "input": "Hello world, this is a test.",
  "instructions": "A calm, low-pitched elderly storyteller with a warm tone."
}'

Testing

  • New Ginkgo specs for newTTSRequest (instructions attached/omitted, params forwarded/nil); updated existing ctx-propagation specs for the new signature.
  • core/backend, core/schema, core/http/endpoints/localai test suites pass.
  • go vet and golangci-lint --new-from-merge-base=master clean (0 issues); both Python backends py_compile clean.

Generated proto bindings (pkg/grpc/proto/*) are gitignored and rebuilt by CI; Python backend_pb2 is regenerated from backend.proto at backend build time.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

The OpenAI-compatible TTS endpoint accepts an `instructions` field, but it
was silently dropped at the HTTP->gRPC boundary: neither schema.TTSRequest
nor the gRPC TTSRequest proto carried it, so backends could only read such a
value from static YAML options (identical for every request). This blocked
per-line emotion/style and, for Qwen3-TTS VoiceDesign, limited a model config
to a single designed voice.

Plumb a generic per-request instruction string end to end, plus an optional
backend-specific params map:

- proto: add `optional string instructions` and `map<string,string> params`
  to TTSRequest.
- schema: add Instructions (maps OpenAI `instructions`) and Params (LocalAI
  extension) to schema.TTSRequest.
- core: thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper
  that attaches instructions only when non-empty (so backends can fall back to
  YAML when unset); forward them from the /v1/audio/speech handler.
- qwen-tts: prefer the per-request instruction over the YAML `instruct` option
  (used by both mode detection and generation) and merge per-request params.
- chatterbox: merge per-request params (coerced to float/int/bool) over YAML
  options into generate() kwargs.

Fully backward compatible: empty instructions fall back to the YAML option and
backends that don't support style/voice instructions ignore the field.

Closes #10164

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler merged commit 27e63b9 into master Jun 4, 2026
73 checks passed
@mudler mudler deleted the worktree-tts-per-request-instructions branch June 4, 2026 09:45
@localai-bot localai-bot added the enhancement New feature or request label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TTS: support per-request instructions/style (unblocks Qwen3-TTS VoiceDesign & per-line emotion)

2 participants