Skip to content

Telegram media delivery: albums silently degrade, captions leak into streamed text, and per-item captions aren't supported #9291

@MidnightLychee

Description

@MidnightLychee

Summary

When an agent response combines prose text with MEDIA: tags on Telegram
(in streaming mode), three issues surface in the delivery pipeline. I've
prepared a patch and will open a PR shortly.

Repro

Trigger an agent response shaped like this (via the Telegram gateway,
streaming mode enabled):

Here is a summary of the catalog.

MEDIA:/path/item_a_view1.png
Item A — view 1
MEDIA:/path/item_a_view2.png
Item A — view 2
MEDIA:/path/item_b_view1.png
Item B — view 1
MEDIA:/path/item_b_view2.png
Item B — view 2

MEDIA:/path/featured.png
Featured item

Expected structure after parsing:

  • 1 TextBlock: "Here is a summary of the catalog."
  • 1 MediaGroupBlock: 4 items, each with its own caption
  • 1 ImageBlock: "Featured item"

Observed

  1. Album doesn't form. The four images arrive as individual photos
    instead of a single Telegram album. Only the first image carries a
    caption; the rest are bare.
  2. Caption text leaks into the streamed prose. The initial streamed
    message still contains the caption lines ("Item A — view 1", etc.)
    even after media delivery runs. The user sees the captions twice —
    once in the prose, once on the images (if at all).
  3. No way to caption each image individually within an album. The
    parser assigns at most one caption to an entire MediaGroupBlock,
    and placing a caption between two MEDIA: lines splits them into
    separate groups — so there is no response shape that produces a
    single album with per-image captions, even though the Telegram
    sendMediaGroup API supports it natively.

Expected

  • One album containing all four images, each with its own caption.
  • One single captioned image.
  • The streamed prose message contains only the intro paragraph — no
    leaked caption text.

Root Causes (from investigation)

  1. TelegramAdapter.send_media_group() opens file handles via
    open(fp, "rb"). When the MarkdownV2 attempt fails, the retry
    path can't reuse the handles cleanly and the call falls through to
    super().send_media_group() (individual sends) while still
    returning success=True.
  2. The delivery path knows to avoid deleting the streamed message when
    TextBlocks coexist with media, but nothing edits the message to
    strip the caption lines after the fact.
  3. _parse_content_blocks() treats caption text as a single
    group-level caption and uses caption-between-MEDIA as a group
    separator, so per-item captions within one album are
    unrepresentable.

Proposed Fix

  • Switch local-file media group items to file_path.read_bytes() so
    the bytes are safely reusable across retry attempts; log a warning
    when the individual-send fallback is actually reached.
  • When a response contains both TextBlocks and captioned media, edit
    the streamed message down to TextBlock content only (falling back
    to delete for pure-media responses).
  • Rework the parser so a caption line immediately following a MEDIA:
    line attaches to that item's caption, and a blank line after caption
    text ends the current group. Trailing caption attaches to the last
    item.

Happy to discuss the API shape and parser semantics before the PR
lands — PR incoming shortly with tests.

Environment

  • hermes-agent main @ b909a9e
  • Telegram gateway, streaming mode

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliveryplatform/telegramTelegram bot adaptertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions