Skip to content

feat(stt): add xAI Grok STT provider#12120

Closed
Julientalbot wants to merge 3 commits into
NousResearch:mainfrom
Julientalbot:feat/xai-stt-provider
Closed

feat(stt): add xAI Grok STT provider#12120
Julientalbot wants to merge 3 commits into
NousResearch:mainfrom
Julientalbot:feat/xai-stt-provider

Conversation

@Julientalbot

@Julientalbot Julientalbot commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Add xAI as a sixth STT provider using the POST /v1/stt endpoint with multipart/form-data.

Features

  • Inverse Text Normalization (ITN) via format=true (default on)
  • Optional diarization via stt.xai.diarize config
  • Language configuration (default: fr, overridable via config or HERMES_LOCAL_STT_LANGUAGE env)
  • Custom base URL (XAI_STT_BASE_URL env or stt.xai.base_url config)
  • Full provider integration: explicit config + auto-detect fallback chain
  • Consistent error handling matching existing provider patterns

Auto-detect priority

localgroqopenaimistralxainone

Configuration

stt:
  provider: xai
  xai:
    language: fr
    format: true        # Inverse Text Normalization
    diarize: false      # Speaker diarization
    base_url: https://api.x.ai/v1   # optional override

Testing

  • 17 new unit tests covering: transcription, error handling, provider selection, dispatch
  • All 89 tests passing (existing + new)

Files changed

  • tools/transcription_tools.py — xAI provider implementation (+120 lines)
  • tests/tools/test_transcription_tools.py — unit tests (+256 lines)

xAI STT API reference

  • Endpoint: POST https://api.x.ai/v1/stt
  • Auth: Bearer token via XAI_API_KEY
  • Input: multipart/form-data (file + optional language, format, diarize)
  • Output: {"text": "...", "language": "fr", "duration": 3.2}
  • 21 languages supported, ~5% WER (best-in-class entity recognition)

Add xAI as a sixth STT provider using the POST /v1/stt endpoint.

Features:
- Multipart/form-data upload to api.x.ai/v1/stt
- Inverse Text Normalization (ITN) via format=true (default)
- Optional diarization via config (stt.xai.diarize)
- Language configuration (default: fr, overridable via config or env)
- Custom base_url support (XAI_STT_BASE_URL env or stt.xai.base_url)
- Full provider integration: explicit config + auto-detect fallback chain
- Consistent error handling matching existing provider patterns

Config (config.yaml):
  stt:
    provider: xai
    xai:
      language: fr
      format: true
      diarize: false
      base_url: https://api.x.ai/v1   # optional override

Auto-detect priority: local > groq > openai > mistral > xai > none
Covers:
- _transcribe_xai: no key, successful transcription, whitespace stripping,
  API error (HTTP 400), empty transcript, permission error, network error,
  language/format params sent, custom base_url, diarize config
- _get_provider xAI: key set, no key, auto-detect after mistral,
  mistral preferred over xai, no key returns none
- transcribe_audio xAI dispatch: dispatch, default model (grok-stt),
  model override
@Julientalbot

Copy link
Copy Markdown
Contributor Author

CI failures on this PR are unrelated to its scope — both are pre-existing regressions on main:

1. test_no_single_field_categories fails because hermes_cli/config.py:775 defines a code_execution category with a single field (mode), violating the assertion count >= 2. Fix options: merge code_execution into another category via the web_server merge map, add a second field, or relax the test.

2. test_config_version_matches_current_schema fails because hermes_cli/config.py:805 bumped _config_version to 19 but tests/tools/test_browser_camofox_state.py:67 still hardcodes == 18. Fix: bump the test assertion to 19.

Both regressions pre-date this PR and affect any currently open PR against main. This PR only touches tools/transcription_tools.py and its test file — scope is fully independent.

Opening a separate focused PR to fix these two so main goes green again. Happy to rebase this one once that merges.

@cetej cetej left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid PR — clean implementation following the existing provider pattern (mirrors _transcribe_mistral well), thorough test coverage (17 tests covering happy path, error handling, env/config fallback, dispatch), and correct multipart upload semantics. Verified tools/xai_http.hermes_xai_user_agent() exists upstream so the import resolves.

CI failures (test_no_single_field_categories, test_config_version_matches_current_schema) are pre-existing on main and orthogonal to this change — your separate fix PR plan is the right call.

A few small nits, none blocking:

  1. Hardcoded language: "fr" default in _transcribe_xai (tools/transcription_tools.py)
    The module already exports DEFAULT_LOCAL_STT_LANGUAGE = "en". The "fr" literal looks like a locale leak — consider:

    language = str(
        xai_config.get("language")
        or os.getenv("HERMES_LOCAL_STT_LANGUAGE")
        or DEFAULT_LOCAL_STT_LANGUAGE
    ).strip()
  2. Redundant default=True on the format flag:

    use_format = is_truthy_value(xai_config.get("format", True), default=True)

    .get("format", True) already returns True when the key is missing, so default=True is unreachable. Either drop the dict default or drop default=True — pick one source of truth.

  3. Stale comment in _get_provider — the auto-detect comment still reads "local > groq > openai > mistral"; worth appending > xai to match the new behavior.

Optional cleanups:

  • _transcribe_xai(file_path, model_name) accepts model_name but never references it; the dispatch comment says "pass through for logging" but logger.info doesn't include it. Either log it or drop the parameter.
  • Minor doc inconsistency: docstring says "26 languages", PR description says "21 languages".
  • Error masking is asymmetric vs. Mistral (which only returns type(e).__name__); your version exposes the full exception, which is actually better for debugging — just flagging the inconsistency.

Security check is clean: API key from env, Bearer in header (not URL), no secret leakage in logs (only lang/duration/char count), reasonable 120s timeout.

Approving — happy to see this land once the three one-line nits are addressed (or even as-is, your call).

- Replace hardcoded 'fr' default with DEFAULT_LOCAL_STT_LANGUAGE ('en')
  — removes locale leak, matches other providers
- Drop redundant default=True on is_truthy_value (dict .get already defaults)
- Update auto-detect comment to include 'xai' in the chain
- Fix docstring: 21 languages (match PR body + actual xAI API)
- Update test_sends_language_and_format to set HERMES_LOCAL_STT_LANGUAGE=fr
  explicitly, since default is no longer 'fr'

All 18 xAI STT tests pass locally.
@Julientalbot

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @cetej! Pushed bd40bac addressing all three nits:

1. Hardcoded fr default → replaced with DEFAULT_LOCAL_STT_LANGUAGE (en). Locale leak fixed, now matches the pattern used by _transcribe_local_command. Updated test_sends_language_and_format to explicitly set HERMES_LOCAL_STT_LANGUAGE=fr via monkeypatch (so the test exercises the override chain rather than depending on a locale default).

2. Redundant default=True → dropped. .get("format", True) is now the single source of truth; is_truthy_value just normalizes config strings ("false"/"no"/etc).

3. Stale auto-detect comment → updated to local > groq > openai > mistral > xai.

Bonus: fixed the docstring inconsistency (21 languages, matching the PR body and actual xAI API).

All 18 xAI STT tests pass locally. The model_name parameter — kept for signature consistency with the other _transcribe_* functions (they all follow (file_path, model_name) even when unused for some providers); can drop if you prefer.

The pre-existing CI failures should clear once #12139 merges.

@Julientalbot

Copy link
Copy Markdown
Contributor Author

Closing — xAI media provider work is being consolidated through @Jaaneek's #10600 line (TTS in #10783, video/image/x_search in #10786). An STT entry is a natural follow-up to that track rather than a separate PR. Happy to revisit once the provider upgrades settle.

@Julientalbot

Copy link
Copy Markdown
Contributor Author

Hi @cetej and NousResearch team,

Reopening this PR. Background: I had closed it few weeks weeks ago anticipating consolidation through @Jaaneek's broader xAI media provider track (#10600), but Jaaneek is currently on leave until mid-May.

Rather than letting the xAI STT contribution sit idle for another month, I'd like to land this independently so the Hermes community can benefit from native Grok STT support now. The code is review-clean (your nits addressed in bd40bac), tests pass, and the implementation follows existing provider patterns without core file modifications.

I'm also in active discussion with the xAI team about deeper Hermes-Grok integration — having STT merged upstream strengthens that relationship and gives users immediate value while the broader media track matures.

Happy to address any fresh feedback promptly. Thanks for considering.

@Julientalbot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants