-
-
Notifications
You must be signed in to change notification settings - Fork 52.7k
Description
Summary
When a Telegram voice message (OGG/Opus) is received, the raw binary audio data is embedded directly into the session context as text/plain. This causes massive token inflation, as binary data gets tokenized as hundreds of thousands of garbage tokens.
Impact
A single 13-second voice note creates a ~440KB session entry. When tokenized, this can produce 200,000–600,000 tokens of binary garbage, exceeding Claude's 200k token context limit and causing silent delivery failures (agent gets 400 error, user sees typing indicator but never receives a response).
Evidence
Session log showing repeated prompt is too long errors across a single day:
07:07 UTC → 501,890 tokens (max 200,000)
07:08 UTC → 482,720 tokens
08:23 UTC → 639,302 tokens
09:16 UTC → 410,635 tokens
The user message entry for a 13-second voice note:
- Session entry size: 448,051 bytes (438 KB)
- Content includes: transcript (correct) + raw OGG binary embedded as
<file name="...ogg" mime="text/plain">
Expected Behavior
Voice messages should include only:
- The transcript text
- A file reference/path (not the binary content)
The raw audio binary should never be inlined as text in the session prompt.
Environment
- OpenClaw version: 2026.1.30
- Node: v22.22.0
- Channel: Telegram (long-polling)
- Model: anthropic/claude-opus-4-5
- TTS config:
messages.tts.auto: "inbound"
Workaround
- Enable
contextPruningwithmode: "cache-ttl"andhardClear.enabled: trueto trim old tool results - Auto-compaction helps but cannot fix single user messages that exceed the model limit
- Session reset (
/new) when context is bloated
Related
- Telegram: voice message silently dropped when caption exceeds limit #6068 (Telegram voice caption overflow)