Skip to content

fix(gateway): diagnosable MEDIA delivery + canonical cache roots (#31733)#33251

Closed
GodsBoy wants to merge 2 commits into
NousResearch:mainfrom
GodsBoy:fix/media-delivery-final-response-parity
Closed

fix(gateway): diagnosable MEDIA delivery + canonical cache roots (#31733)#33251
GodsBoy wants to merge 2 commits into
NousResearch:mainfrom
GodsBoy:fix/media-delivery-final-response-parity

Conversation

@GodsBoy

@GodsBoy GodsBoy commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds diagnosable MEDIA delivery, canonical cache roots, and two pieces of
untrusted-input hardening to gateway/platforms/base.py, rebased on top of
the denylist-only default that landed in #34022.

When #34022 flipped the default to denylist-only it restored delivery of stale
and outside-allowlist artifacts, which solved the most visible "MEDIA didn't
attach" reports. It did not, however, make rejections diagnosable, add the
canonical cache roots that strict mode still needs, or fix two latent
robustness/integrity bugs in the same code path. This PR closes those gaps and
complements #34022 rather than competing with it.

What this PR adds (on top of #34022)

  • Diagnosable rejection reasons. validate_media_delivery_path now
    delegates to a private _validate_media_delivery_path_with_reason(path) -> (resolved_or_None, kebab_reason). filter_media_delivery_paths and
    filter_local_delivery_paths log the reason plus a redacted path, so an
    operator can tell denylist-cleared from outside-allowlist-denied-prefix
    from outside-allowlist-stale-mtime from does-not-resolve without
    rebuilding the bot with debug logging. The reason taxonomy covers both the
    default denylist-only path and the opt-in strict allowlist/recency path.

  • Canonical cache roots. MEDIA_DELIVERY_SAFE_ROOTS lists the canonical
    ~/.hermes/cache/{images,audio,videos,documents,screenshots} explicitly
    alongside the legacy ~/.hermes/{type}_cache/ entries. get_hermes_dir
    returns the legacy path when it exists, so IMAGE_CACHE_DIR alone drops the
    canonical layout on co-existing installs. This is the root cause of Generated images under cache/images may not deliver via Telegram MEDIA, while image_cache works #31733
    and still bites in strict mode, where the allowlist is the gate.

  • Log-forging hardening. _redact_path_for_log neutralises control
    characters AND Unicode line separators (NEL U+0085, LS U+2028, PS U+2029) so
    a model-emitted path cannot forge a second operator log line. The C0 range
    alone is not enough: str.splitlines() and most log aggregators split on
    U+2028/U+2029 too.

  • Batch-abort fix. os.path.expanduser is now wrapped so a ~\x00...
    path returns does-not-resolve instead of raising ValueError: embedded null byte, and both filter loops gain a per-item guard. Without this, a
    single crafted path silently drops every other attachment in the response.

Why this is still needed after #34022

Verified against current main:

Relationship to the MEDIA-delivery cluster

How to test

scripts/run_tests.sh tests/gateway/test_platform_base.py
=== Summary: 1 files, 136 tests passed, 0 failed ===

Broader MEDIA-touching suite (tests/gateway, tests/cron/test_scheduler.py):
6127 passed, 0 failed.

New / updated regression coverage:

  • Default denylist-only accept (denylist-cleared) and denied-prefix
    reject, plus strict-mode allowlist/recency reasons, all asserted against
    imported _MEDIA_REASON_* constants (no raw strings)
  • MEDIA_DELIVERY_SAFE_ROOTS contains both legacy and canonical layouts
  • legacy image_cache and canonical cache/images co-exist on disk; both
    deliverable; PDF deliverable from canonical and legacy document dirs
  • direct unit coverage for _redact_path_for_log (elision boundary, C0
    neutralisation, U+2028/U+2029/U+0085 neutralisation, unprintable fallback)
  • U+2028 in a MEDIA path produces exactly one log record
  • ~\x00... path and a synthetic validator raise each drop only the
    offending item; the good attachment survives the batch
  • public validate_media_delivery_path agrees with the reason helper
    across both modes and all branches
  • reason-tag log coverage for not-absolute and not-a-file

Deferred follow-up (out of scope)

  • Surfacing the rejection reason to the calling agent (send_message returns
    {success: true} regardless of dropped attachments; the agent has no
    programmatic way to observe the drop). The new _MEDIA_REASON_* vocabulary
    is the natural shape for a media_rejected: [{path_hint, reason}] field.
  • Unifying the three dispatch-site extension sets
    (_process_message_background, _deliver_media_from_response,
    tools/send_message_tool). This PR only hoists the final-response partition;
    tools/send_message_tool._VIDEO_EXTS still omits .webm today.
  • Pre-existing residuals: symlink-swap of a cache root before image_generate
    writes; TOCTOU between resolve() and upload; /var/tmp and /var/cache
    not in the denylist.

@GodsBoy

GodsBoy commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Telegram DM screenshot of the failure mode this PR targets.

Walkthrough of the conversation in the screenshot:

  1. Agent confirms the PDF exists at /tmp/pool_evidence/Cosmo_Pools_Quotation_PFA30971.pdf (ls -lh terminal block).
  2. Agent emits the final response saying "Here it is" with a MEDIA: tag for that path. No attachment appears in Telegram.
  3. I reply "There is no pdf here?" because there is no attachment, only text.
  4. Agent recovers by running mkdir -p /root/.hermes/document_cache, copying the file in, and invoking the send_message tool against the new path.
  5. Telegram delivers the file natively: Cosmo_Pools_Quotation_PFA30971.pdf, 174.8 KB, message ID 7066.
  6. Agent's own self-narration confirms the diagnosis: "You're right, Telegram didn't attach it from the /tmp path. I copied it into the document cache and sent it directly to Telegram now."

This is the diagnosability gap this PR closes. Today the gateway log for step 2 says only "Skipping unsafe MEDIA directive path outside allowed roots" with no path and no reason, so the agent has nothing to act on except blind retry. With the changes in this PR the same step logs e.g. Skipping unsafe MEDIA directive path /tmp/.../Cosmo_Pools_Quotation_PFA30971.pdf, reason=outside-allowlist-stale-mtime, which is actionable both for the operator (Grafana/log search) and for the agent itself if/when we surface the reason back to the tool return value (tracked as deferred follow-up in the PR description).

Screenshot_20260527_134730_Telegram

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 27, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #31764 and #33238 (both targeting the same #31733 cache root mismatch). This PR is the broadest of the three — adds diagnosable rejection reasons and canonical+legacy safe roots. Complements #31764 per author.

@GodsBoy GodsBoy marked this pull request as ready for review May 27, 2026 13:19
@shashb27

Copy link
Copy Markdown

Confirmed this bug cluster from the operator side on Windows 11 / Telegram DM delivery.

In my repro, the normal Hermes reply path reported success but the attachment still did not appear to the user. The decisive signal was:

  • gateway log: Skipping unsafe MEDIA directive path outside allowed roots
  • direct Telegram Bot API sendPhoto with the same JPG succeeded immediately and returned native photo sizes up to 1024x1024

That strongly supports the thesis here that there is a final-response / MEDIA-path parity problem distinct from raw Telegram/media validity.

Separately, I also reproduced a second independent bug in _send_via_adapter() dropping media_files entirely; that one belongs to #23760 / #18686. So from field validation, there are indeed multiple overlapping regressions in this MEDIA-delivery area.

GodsBoy added 2 commits May 29, 2026 08:48
…sResearch#31733)

Builds on the direction in NousResearch#31764 with a broader cut at the same bug
surface that surfaces in operator reports as "MEDIA tag did not attach":

* MEDIA_DELIVERY_SAFE_ROOTS now lists the canonical
  ~/.hermes/cache/{images,audio,videos,documents,screenshots} explicitly
  alongside the legacy ~/.hermes/{type}_cache/ entries. get_hermes_dir
  returns the legacy path when it exists, so IMAGE_CACHE_DIR alone drops
  the canonical layout on co-existing installs (root cause of NousResearch#31733).

* validate_media_delivery_path now delegates to a private
  _validate_media_delivery_path_with_reason helper that returns
  (resolved_or_None, kebab_reason). filter_media_delivery_paths and
  filter_local_delivery_paths log the reason plus a redacted path
  (_redact_path_for_log), so operators chasing rejected MEDIA tags can
  tell allowlist-miss from stale-mtime from denied-prefix from
  does-not-resolve without rebuilding the bot with debug logging.

* _redact_path_for_log neutralises control characters (newline, NUL,
  carriage return) so a model-emitted path containing an embedded
  newline cannot forge a second log line.

* Hoisted _FINAL_RESPONSE_IMAGE_EXTS / _FINAL_RESPONSE_VIDEO_EXTS to
  module scope so the regression test imports the same partition the
  final-response dispatch actually uses, instead of snapshotting a
  local copy that could drift.

* Regression coverage added in tests/gateway/test_platform_base.py:
  - default safe-roots tuple contains both legacy and canonical layouts
  - legacy and canonical image caches co-exist (the strengthened
    failure-shape test NousResearch#31764 review asked for)
  - PDF deliverable from canonical ~/.hermes/cache/documents and
    legacy ~/.hermes/document_cache
  - filter_*_delivery_paths logs the redacted path and the kebab
    reason tag (using the imported _MEDIA_REASON_* constants, not raw
    strings) for stale-mtime, no-recency, does-not-resolve,
    denied-prefix, and newline-injection cases
  - validate_media_delivery_path delegation invariant: public API
    matches the resolved path returned by the reason-bearing helper
  - send_message tool and final-response paths share extract_media +
    filter_media_delivery_paths (asserted via source introspection so
    a future split surfaces here)
  - PDF MEDIA tag routes to send_document via the production
    extension partition (no image / video misroute)

Scope: gateway media delivery only. Does not duplicate NousResearch#32472's
extension-derivation refactor or NousResearch#31561 / NousResearch#33206's markdown attachment
recognition work; if any of those land first, the diagnosable-logging
contract here still applies and the canonical-root entries become a
clean merge against NousResearch#31764.

Deferred follow-up (out of scope for this PR):
- Surfacing the rejection reason to the calling agent (send_message
  tool currently returns {success: true} regardless of dropped
  attachments)
- Unifying the three dispatch-site extension sets across
  _process_message_background, _deliver_media_from_response, and the
  send_message tool path
- Pre-existing residual: symlink-swap of a cache root before
  image_generate writes; TOCTOU between resolve() and upload;
  /var/tmp / /var/cache not in denylist
Review follow-ups on the diagnosable-MEDIA-delivery change:

- _redact_path_for_log now neutralizes Unicode line separators (NEL
  U+0085, LS U+2028, PS U+2029) in addition to the C0/DEL range. The
  previous regex only covered control chars, so a model-emitted path with
  an embedded U+2028 still split into a second log line and defeated the
  stated 'cannot forge a fake second log line' guarantee.
- _validate_media_delivery_path_with_reason wraps os.path.expanduser so a
  '~\x00...' path returns does-not-resolve instead of raising
  ValueError. The filter loops also gain a per-item guard so one
  unprocessable path can no longer abort the batch and silently drop every
  other attachment in the response.
- Inlined the _FINAL_RESPONSE_*_EXTS module constants at the dispatch
  sites, removing the pointless local aliases, and dropped a redundant
  str() coercion in the validator.

Tests (tests/gateway/test_platform_base.py, 116 -> 128):

- Direct unit coverage for _redact_path_for_log (empty, short no-elide,
  long elision, C0 neutralization, Unicode line separators, unprintable
  fallback).
- Batch-isolation regression: NUL-after-tilde path and a synthetic
  validator raise both drop only the offending item; the good attachment
  survives.
- Delegation parity test now reaches the recency-trusted ACCEPT and
  not-a-file branches that were previously unreachable.
- Reason-tag log coverage for not-absolute and not-a-file.
- De-overclaimed the 'identical for both paths' parity test name and
  docstring (it exercises the shared surface once; structural parity is
  asserted separately by the import test).
@GodsBoy GodsBoy force-pushed the fix/media-delivery-final-response-parity branch from 06ec17d to 7e3b47a Compare May 29, 2026 06:59
teknium1 added a commit that referenced this pull request May 29, 2026
…null-path guard

Operators can now see which MEDIA path was dropped and why, generated
artifacts under the canonical ~/.hermes/cache/{images,...} layout deliver,
and a crafted ~\x00 path no longer aborts the whole attachment batch.

- MEDIA_DELIVERY_SAFE_ROOTS: add canonical cache/{images,audio,videos,
  documents,screenshots} alongside the legacy *_cache dirs (#31733).
- filter_media/local_delivery_paths: log the rejected path (was a blind
  "outside allowed roots") via _log_safe_path, which strips control chars
  and Unicode line separators so a model-emitted path can't forge a log line.
- validate_media_delivery_path + extract_media: guard os.path.expanduser
  so a ~\x00 path returns None / is skipped instead of raising and dropping
  every other attachment in the response.

Salvaged and slimmed from #33251 (780 LOC -> 35): the reason-tag taxonomy,
the parts-eliding redactor, and the extension-partition hoist are dropped in
favor of logging the path directly. All three findings were verified and
reproduced by the contributor.

Co-authored-by: wysie <wysie@users.noreply.github.com>
@teknium1

Copy link
Copy Markdown
Contributor

Merged via #34485 (commit e28a668 on main), with your authorship preserved via a Co-authored-by trailer.

All three of your findings were real, verified, and reproduced — and they shipped:

  • Canonical cache roots (Generated images under cache/images may not deliver via Telegram MEDIA, while image_cache works #31733): ~/.hermes/cache/{images,audio,videos,documents,screenshots} now sit in MEDIA_DELIVERY_SAFE_ROOTS alongside the legacy *_cache dirs.
  • Diagnosable rejections: filter_media_delivery_paths / filter_local_delivery_paths now log the rejected path (the old blind "outside allowed roots" is gone), with control-char + Unicode line-separator (NEL/LS/PS) hardening so a model-emitted path can't forge a log line.
  • Batch-abort guard: a crafted ~\x00 path no longer raises ValueError("embedded null byte") and drops every other attachment — it's skipped instead. We also closed the upstream extract_media ingestion site, which this PR's loops didn't cover.

We landed a slimmed implementation (~35 LOC of production change vs ~780): the _MEDIA_REASON_* taxonomy, the _validate_..._with_reason split, the parts-eliding _redact_path_for_log, and the _FINAL_RESPONSE_*_EXTS hoist were dropped in favor of logging the path directly — the path is the diagnostic, and the reason vocabulary was shaped for a media_rejected[] return field that doesn't exist yet. The behavior is identical to what you proposed. Thanks for the thorough repro work on all three.

@teknium1

Copy link
Copy Markdown
Contributor

Attribution correction: the salvage PR #34485 originally credited @wysie by mistake. Your work here (#33251) — all three findings — was correctly authored by you, @GodsBoy (Dewaldt Huysamen). I've fixed the credit on #34485 to reflect that. Thanks for the contribution and for catching the error.

KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
…null-path guard

Operators can now see which MEDIA path was dropped and why, generated
artifacts under the canonical ~/.hermes/cache/{images,...} layout deliver,
and a crafted ~\x00 path no longer aborts the whole attachment batch.

- MEDIA_DELIVERY_SAFE_ROOTS: add canonical cache/{images,audio,videos,
  documents,screenshots} alongside the legacy *_cache dirs (NousResearch#31733).
- filter_media/local_delivery_paths: log the rejected path (was a blind
  "outside allowed roots") via _log_safe_path, which strips control chars
  and Unicode line separators so a model-emitted path can't forge a log line.
- validate_media_delivery_path + extract_media: guard os.path.expanduser
  so a ~\x00 path returns None / is skipped instead of raising and dropping
  every other attachment in the response.

Salvaged and slimmed from NousResearch#33251 (780 LOC -> 35): the reason-tag taxonomy,
the parts-eliding redactor, and the extension-partition hoist are dropped in
favor of logging the path directly. All three findings were verified and
reproduced by the contributor.

Co-authored-by: wysie <wysie@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generated images under cache/images may not deliver via Telegram MEDIA, while image_cache works

4 participants