Skip to content

fix: use UTF-16 length for Telegram stream consumer message splitting#11170

Closed
freemanconsulting wants to merge 1 commit into
NousResearch:mainfrom
Freeman-Consulting:fix/telegram-stream-utf16-length
Closed

fix: use UTF-16 length for Telegram stream consumer message splitting#11170
freemanconsulting wants to merge 1 commit into
NousResearch:mainfrom
Freeman-Consulting:fix/telegram-stream-utf16-length

Conversation

@freemanconsulting

Copy link
Copy Markdown
Contributor

Problem

The stream consumer measured message length using Python's len() (Unicode code points), but Telegram's actual limit is in UTF-16 code units. This caused messages with supplementary characters (emoji, CJK text, etc.) to exceed Telegram's 4096-character limit, resulting in:

  • Truncated messages cut off mid-sentence
  • Black square (\x00) rendering artifacts from incomplete MarkdownV2 placeholder processing
  • Failed editMessageText calls that silently fell back to plain text

The send() method in TelegramAdapter already correctly used utf16_len for truncate_message, but the streaming path in GatewayStreamConsumer did not.

Solution

Three changes designed for easy adoption by any platform adapter:

  1. BasePlatformAdapter.message_len_fn — new property that returns len by default. Platforms that measure differently (like Telegram) override this.

  2. TelegramAdapter.message_len_fn — returns utf16_len so all downstream consumers get correct lengths automatically.

  3. GatewayStreamConsumer — uses adapter.message_len_fn instead of bare len for:

    • _safe_limit calculation
    • overflow detection
    • truncate_message calls (passes len_fn)
    • split point calculation (via _custom_unit_to_cp)
    • fallback final send chunking

Testing

  • All 58 stream consumer tests pass
  • All 3009 gateway tests pass (7 pre-existing flaky failures unrelated to this change)
  • Backwards compatible: adapters that don't override message_len_fn get len behavior

The stream consumer measured message length using Python's len() (Unicode
code points), but Telegram's actual limit is in UTF-16 code units. This
caused messages with supplementary characters (emoji, CJK, etc.) to exceed
Telegram's 4096-character limit, resulting in truncated messages with
formatting artifacts.

Changes:
- Add message_len_fn property to BasePlatformAdapter (defaults to len)
- Override in TelegramAdapter to return utf16_len
- Stream consumer uses adapter.message_len_fn for:
  - safe_limit calculation
  - overflow detection
  - truncate_message calls
  - split point calculation (via _custom_unit_to_cp)
  - fallback final send chunking

Fixes truncated messages with black square artifacts on Telegram when
the model generates responses containing multi-byte Unicode characters.
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround platform/telegram Telegram bot adapter comp/gateway Gateway runner, session dispatch, delivery labels Apr 25, 2026
teknium1 added a commit that referenced this pull request May 10, 2026
New TestUtf16OverflowDetection class covers two scenarios:
- test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds
  2200 emoji codepoints (4400 UTF-16 units) — under Telegram's
  codepoint-equivalent limit but over its UTF-16 limit. Asserts
  truncate_message was called with len_fn=utf16_len, confirming the
  consumer detected the overflow.
- test_codepoint_only_adapter_falls_back_to_len: documents that
  adapters which don't subclass BasePlatformAdapter (or test MagicMocks)
  fall back to plain len for backwards compat.

The contributor's PR shipped no tests for the UTF-16 path.
@teknium1

Copy link
Copy Markdown
Contributor

Merged via salvage PR #23455 (#23455). Your commit was cherry-picked onto current main with your authorship preserved. During conflict resolution we cleaned up the duplicated try/except+delayed-import dance into a single module-level import + isinstance ternary, dropped the 'if _len_fn is not len' micro-opt guards around _custom_unit_to_cp (the helper short-circuits internally), added an inline comment at the buffer_threshold trigger noting it's intentionally codepoint-based (debounce heuristic, not a platform-limit check), and added two regression tests for the UTF-16 path. Thanks for catching this!

@teknium1 teknium1 closed this May 10, 2026
JZKK720 pushed a commit to JZKK720/hermes-agent that referenced this pull request May 11, 2026
…esearch#11170

New TestUtf16OverflowDetection class covers two scenarios:
- test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds
  2200 emoji codepoints (4400 UTF-16 units) — under Telegram's
  codepoint-equivalent limit but over its UTF-16 limit. Asserts
  truncate_message was called with len_fn=utf16_len, confirming the
  consumer detected the overflow.
- test_codepoint_only_adapter_falls_back_to_len: documents that
  adapters which don't subclass BasePlatformAdapter (or test MagicMocks)
  fall back to plain len for backwards compat.

The contributor's PR shipped no tests for the UTF-16 path.
rmulligan pushed a commit to rmulligan/hermes-agent that referenced this pull request May 11, 2026
…esearch#11170

New TestUtf16OverflowDetection class covers two scenarios:
- test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds
  2200 emoji codepoints (4400 UTF-16 units) — under Telegram's
  codepoint-equivalent limit but over its UTF-16 limit. Asserts
  truncate_message was called with len_fn=utf16_len, confirming the
  consumer detected the overflow.
- test_codepoint_only_adapter_falls_back_to_len: documents that
  adapters which don't subclass BasePlatformAdapter (or test MagicMocks)
  fall back to plain len for backwards compat.

The contributor's PR shipped no tests for the UTF-16 path.
JinyuID pushed a commit to JinyuID/hermes-agent that referenced this pull request May 11, 2026
…esearch#11170

New TestUtf16OverflowDetection class covers two scenarios:
- test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds
  2200 emoji codepoints (4400 UTF-16 units) — under Telegram's
  codepoint-equivalent limit but over its UTF-16 limit. Asserts
  truncate_message was called with len_fn=utf16_len, confirming the
  consumer detected the overflow.
- test_codepoint_only_adapter_falls_back_to_len: documents that
  adapters which don't subclass BasePlatformAdapter (or test MagicMocks)
  fall back to plain len for backwards compat.

The contributor's PR shipped no tests for the UTF-16 path.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…esearch#11170

New TestUtf16OverflowDetection class covers two scenarios:
- test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds
  2200 emoji codepoints (4400 UTF-16 units) — under Telegram's
  codepoint-equivalent limit but over its UTF-16 limit. Asserts
  truncate_message was called with len_fn=utf16_len, confirming the
  consumer detected the overflow.
- test_codepoint_only_adapter_falls_back_to_len: documents that
  adapters which don't subclass BasePlatformAdapter (or test MagicMocks)
  fall back to plain len for backwards compat.

The contributor's PR shipped no tests for the UTF-16 path.
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
…esearch#11170

New TestUtf16OverflowDetection class covers two scenarios:
- test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds
  2200 emoji codepoints (4400 UTF-16 units) — under Telegram's
  codepoint-equivalent limit but over its UTF-16 limit. Asserts
  truncate_message was called with len_fn=utf16_len, confirming the
  consumer detected the overflow.
- test_codepoint_only_adapter_falls_back_to_len: documents that
  adapters which don't subclass BasePlatformAdapter (or test MagicMocks)
  fall back to plain len for backwards compat.

The contributor's PR shipped no tests for the UTF-16 path.
AlexFoxD pushed a commit to AlexFoxD/hermes-agent that referenced this pull request May 21, 2026
…esearch#11170

New TestUtf16OverflowDetection class covers two scenarios:
- test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds
  2200 emoji codepoints (4400 UTF-16 units) — under Telegram's
  codepoint-equivalent limit but over its UTF-16 limit. Asserts
  truncate_message was called with len_fn=utf16_len, confirming the
  consumer detected the overflow.
- test_codepoint_only_adapter_falls_back_to_len: documents that
  adapters which don't subclass BasePlatformAdapter (or test MagicMocks)
  fall back to plain len for backwards compat.

The contributor's PR shipped no tests for the UTF-16 path.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…esearch#11170

New TestUtf16OverflowDetection class covers two scenarios:
- test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds
  2200 emoji codepoints (4400 UTF-16 units) — under Telegram's
  codepoint-equivalent limit but over its UTF-16 limit. Asserts
  truncate_message was called with len_fn=utf16_len, confirming the
  consumer detected the overflow.
- test_codepoint_only_adapter_falls_back_to_len: documents that
  adapters which don't subclass BasePlatformAdapter (or test MagicMocks)
  fall back to plain len for backwards compat.

The contributor's PR shipped no tests for the UTF-16 path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround platform/telegram Telegram bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants