Skip to content

[Bug] Voice STT: empty moonshine transcripts passed as raw JSON to LLM, clogging serialized processing queue #84660

@Joel-Claw

Description

@Joel-Claw

Summary

When using moonshine-tiny-en for Discord voice STT, empty/noisy transcripts are passed as raw JSON strings to the LLM instead of being filtered out. This wastes ~4 seconds and ~24k input tokens per empty segment and clogs the serialized processing queue, making the bot appear unresponsive in voice.

Reproduction

  1. Configure OpenClaw with voice.mode = "stt-tts" and moonshine-tiny-en as the STT model
  2. Join a voice channel with background noise or short utterances
  3. Observe that short/noisy audio segments produce empty transcripts: {"lang": "", "emotion": "", "event": "", "text": "", "timestamps": [], "durations": [], "tokens":[], "ys_log_probs": [], "words": []}
  4. These empty JSON strings are sent to the LLM as "transcripts" instead of being filtered
  5. The LLM returns NO_REPLY (correct behavior), but each call wastes ~4s and ~24k tokens
  6. The serialized processing queue (entry.processingQueue) blocks until each call completes
  7. With ~35% of segments being empty JSON, the pipeline appears to "stop" responding

Root Cause

In manager.runtime, transcribeVoiceAudio() calls normalizeOptionalString() on the STT result, which returns undefined for empty strings. However, the sherpa-onnx CLI output includes the entire JSON object on the last line, and the mediaUnderstanding.transcribeAudioFile() result appears to include the full JSON string as text even when the "text" field within it is empty.

The check at line ~1441 (if (!transcript)) catches undefined but NOT the full JSON string with an empty "text" field. So {"text": "", ...} passes through as a non-empty string transcript.

Evidence

Session logs show:

Voice transcript from speaker "[CK] Alex the 'guin":
{"lang": "", "emotion": "", "event": "", "text": "", "timestamps": [], "durations": [], "tokens":[], "ys_log_probs": [], "words": []}

100% of NO_REPLY responses (8 out of 8 in a recent session) were triggered by these empty JSON transcripts. The bot responded correctly to all real transcripts but was blocked during empty JSON processing.

52 segment files accumulated in 10 minutes. Only 10 TTS outputs were generated. The pipeline was processing empty JSON ~35% of the time.

Expected Behavior

  1. When the STT model returns "text": "" (or equivalent empty transcript), the segment should be skipped entirely — no LLM call needed
  2. The serialized processing queue should have a max depth or stale-segment discard mechanism to prevent pipeline stalls

Environment

  • OpenClaw 2026.5.18
  • sherpa-onnx moonshine-tiny-en (int8)
  • Discord voice mode: stt-tts
  • Platform: Linode 4 vCPU, 8GB RAM

Workaround

Reducing captureSilenceGraceMs (from 1500 to 1000) and timeoutSeconds (from 300 to 120) helps marginally, plus periodic cleanup of stale /tmp/openclaw/discord-voice-*/segment.wav files. But the core issue is that empty transcripts should be filtered before reaching the LLM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions