fix: use UTF-16 length for Telegram stream consumer message splitting by freemanconsulting · Pull Request #11170 · NousResearch/hermes-agent

freemanconsulting · 2026-04-16T18:08:28Z

Problem

The stream consumer measured message length using Python's len() (Unicode code points), but Telegram's actual limit is in UTF-16 code units. This caused messages with supplementary characters (emoji, CJK text, etc.) to exceed Telegram's 4096-character limit, resulting in:

Truncated messages cut off mid-sentence
Black square (\x00) rendering artifacts from incomplete MarkdownV2 placeholder processing
Failed editMessageText calls that silently fell back to plain text

The send() method in TelegramAdapter already correctly used utf16_len for truncate_message, but the streaming path in GatewayStreamConsumer did not.

Solution

Three changes designed for easy adoption by any platform adapter:

BasePlatformAdapter.message_len_fn — new property that returns len by default. Platforms that measure differently (like Telegram) override this.
TelegramAdapter.message_len_fn — returns utf16_len so all downstream consumers get correct lengths automatically.
GatewayStreamConsumer — uses adapter.message_len_fn instead of bare len for:
- _safe_limit calculation
- overflow detection
- truncate_message calls (passes len_fn)
- split point calculation (via _custom_unit_to_cp)
- fallback final send chunking

Testing

All 58 stream consumer tests pass
All 3009 gateway tests pass (7 pre-existing flaky failures unrelated to this change)
Backwards compatible: adapters that don't override message_len_fn get len behavior

The stream consumer measured message length using Python's len() (Unicode code points), but Telegram's actual limit is in UTF-16 code units. This caused messages with supplementary characters (emoji, CJK, etc.) to exceed Telegram's 4096-character limit, resulting in truncated messages with formatting artifacts. Changes: - Add message_len_fn property to BasePlatformAdapter (defaults to len) - Override in TelegramAdapter to return utf16_len - Stream consumer uses adapter.message_len_fn for: - safe_limit calculation - overflow detection - truncate_message calls - split point calculation (via _custom_unit_to_cp) - fallback final send chunking Fixes truncated messages with black square artifacts on Telegram when the model generates responses containing multi-byte Unicode characters.

New TestUtf16OverflowDetection class covers two scenarios: - test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds 2200 emoji codepoints (4400 UTF-16 units) — under Telegram's codepoint-equivalent limit but over its UTF-16 limit. Asserts truncate_message was called with len_fn=utf16_len, confirming the consumer detected the overflow. - test_codepoint_only_adapter_falls_back_to_len: documents that adapters which don't subclass BasePlatformAdapter (or test MagicMocks) fall back to plain len for backwards compat. The contributor's PR shipped no tests for the UTF-16 path.

teknium1 · 2026-05-10T23:21:15Z

Merged via salvage PR #23455 (#23455). Your commit was cherry-picked onto current main with your authorship preserved. During conflict resolution we cleaned up the duplicated try/except+delayed-import dance into a single module-level import + isinstance ternary, dropped the 'if _len_fn is not len' micro-opt guards around _custom_unit_to_cp (the helper short-circuits internally), added an inline comment at the buffer_threshold trigger noting it's intentionally codepoint-based (debounce heuristic, not a platform-limit check), and added two regression tests for the UTF-16 path. Thanks for catching this!

…esearch#11170 New TestUtf16OverflowDetection class covers two scenarios: - test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds 2200 emoji codepoints (4400 UTF-16 units) — under Telegram's codepoint-equivalent limit but over its UTF-16 limit. Asserts truncate_message was called with len_fn=utf16_len, confirming the consumer detected the overflow. - test_codepoint_only_adapter_falls_back_to_len: documents that adapters which don't subclass BasePlatformAdapter (or test MagicMocks) fall back to plain len for backwards compat. The contributor's PR shipped no tests for the UTF-16 path.

alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround platform/telegram Telegram bot adapter comp/gateway Gateway runner, session dispatch, delivery labels Apr 25, 2026

teknium1 mentioned this pull request May 10, 2026

fix(stream-consumer): use UTF-16 length for Telegram message splitting (salvage of #11170) #23455

Merged

teknium1 closed this May 10, 2026

bestsleepit-creator mentioned this pull request May 15, 2026

fix(ci): map zccyman noreply email in AUTHOR_MAP #26295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use UTF-16 length for Telegram stream consumer message splitting#11170

fix: use UTF-16 length for Telegram stream consumer message splitting#11170
freemanconsulting wants to merge 1 commit into
NousResearch:mainfrom
Freeman-Consulting:fix/telegram-stream-utf16-length

freemanconsulting commented Apr 16, 2026

Uh oh!

teknium1 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

freemanconsulting commented Apr 16, 2026

Problem

Solution

Testing

Uh oh!

teknium1 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants