Fix paperclip recall fan-out and Signal typing retry spam by kshitijk4poor · Pull Request #12056 · NousResearch/hermes-agent

kshitijk4poor · 2026-04-18T07:55:58Z

Summary

serialize session_search summaries and stop retrying on explicit 429 throttling
turn Signal typing into a throttled background loop so failed sendTyping RPCs do not respawn every 2 seconds
add regression coverage for session_search rate-limit handling and Signal typing backoff

Problem

The uploaded logs showed two separate loop patterns:

session_search fanned out multiple auxiliary summary requests at once, then retried through timeouts and 429 Too Many Requests, which is a bad fit for Paperclip-style integrations that repeatedly ask Hermes for recall
Signal sendTyping failures were logged on every refresh cycle, so the adapter stayed noisy and looked stuck while transport health was degraded

Verification

scripts/run_tests.sh tests/tools/test_session_search.py tests/gateway/test_signal.py -q
python -m py_compile gateway/platforms/signal.py tools/session_search_tool.py tests/gateway/test_signal.py tests/tools/test_session_search.py

Notes

I did not reproduce against a live Paperclip adapter or live Signal daemon; this fix is grounded in the uploaded debug report and targeted regression tests.

Paperclip-linked recall was fanning out session summaries in parallel, which triggered rate limits and timeouts in the uploaded logs. Signal typing failures were also logging every refresh cycle, so the adapter kept looking busy while transport health was degraded. Constraint: Paperclip-style integrations tag tool sessions with source=tool and can trigger repeated session_search recall Constraint: Signal transport failures should not flood gateway logs every two seconds Rejected: Keep parallel session summarization with more retries | amplified 429/timeouts in the uploaded logs Rejected: Disable Signal typing indicators entirely | loses useful UX when the transport is healthy Confidence: medium Scope-risk: moderate Reversibility: clean Directive: If session_search concurrency is raised again, verify auxiliary providers under rate-limit pressure before shipping Tested: scripts/run_tests.sh tests/tools/test_session_search.py tests/gateway/test_signal.py -q Tested: python -m py_compile gateway/platforms/signal.py tools/session_search_tool.py tests/gateway/test_signal.py tests/tools/test_session_search.py Not-tested: Live Paperclip adapter against a real remote provider Not-tested: Live Signal daemon/network failure behavior

The review follow-up found two behavioral gaps in the previous fix. Signal typing now distinguishes transport failure from successful JSON-RPC replies that carry a null result, and session_search keeps a bounded serial budget so slow providers degrade to partial raw previews instead of failing the whole tool call. Constraint: signal-cli side-effect RPCs may succeed with null result payloads Constraint: session_search must avoid turning slow auxiliary providers into whole-tool failures Rejected: Treat any missing result as failure | breaks typing if sendTyping returns JSON null on success Rejected: Leave serial session_search unbounded | can still hit the outer 300s sync-async timeout and lose partial work Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep Signal RPC semantics explicit per method; do not infer success from payload shape without adapter tests Tested: /Users/kshitij/Projects/hermes-agent/scripts/run_tests.sh tests/tools/test_session_search.py tests/gateway/test_signal.py -q Tested: python -m py_compile gateway/platforms/signal.py tools/session_search_tool.py tests/gateway/test_signal.py tests/tools/test_session_search.py Not-tested: Live Paperclip adapter against a real remote provider Not-tested: Live Signal daemon with JSON null sendTyping responses

base.py's _keep_typing refresh loop calls send_typing every ~2s while the agent is processing. If signal-cli returns NETWORK_FAILURE for the recipient (offline, unroutable, group membership lost), the unmitigated path was a WARNING log every 2 seconds for as long as the agent stayed busy — a user report showed 1048 warnings in 41 minutes for one offline contact, plus the matching volume of pointless RPC traffic to signal-cli. - _rpc() accepts log_failures=False so callers can route repeated expected failures (typing) to DEBUG while keeping send/receive at WARNING. - send_typing() tracks consecutive failures per chat. First failure still logs WARNING so transport issues remain visible; subsequent failures log at DEBUG. After three consecutive failures we skip the RPC during an exponential cooldown (16s, 32s, 60s cap) so we stop hammering signal-cli for a recipient it can't deliver to. A successful sendTyping resets the counters. - _stop_typing_indicator() clears the backoff state so the next agent turn starts fresh. E2E simulation against the reported 41-minute window: RPCs drop from 1230 to 45 (-96%), log lines from 1048 WARNINGs to 1 WARNING + 44 DEBUGs. Credits kshitijk4poor (#12056) for the _rpc log_failures kwarg idea; the broader restructure in that PR (nested per-chat loop inside send_typing) is avoided here in favour of stateful backoff that preserves base.py's existing _keep_typing architecture.

) base.py's _keep_typing refresh loop calls send_typing every ~2s while the agent is processing. If signal-cli returns NETWORK_FAILURE for the recipient (offline, unroutable, group membership lost), the unmitigated path was a WARNING log every 2 seconds for as long as the agent stayed busy — a user report showed 1048 warnings in 41 minutes for one offline contact, plus the matching volume of pointless RPC traffic to signal-cli. - _rpc() accepts log_failures=False so callers can route repeated expected failures (typing) to DEBUG while keeping send/receive at WARNING. - send_typing() tracks consecutive failures per chat. First failure still logs WARNING so transport issues remain visible; subsequent failures log at DEBUG. After three consecutive failures we skip the RPC during an exponential cooldown (16s, 32s, 60s cap) so we stop hammering signal-cli for a recipient it can't deliver to. A successful sendTyping resets the counters. - _stop_typing_indicator() clears the backoff state so the next agent turn starts fresh. E2E simulation against the reported 41-minute window: RPCs drop from 1230 to 45 (-96%), log lines from 1048 WARNINGs to 1 WARNING + 44 DEBUGs. Credits kshitijk4poor (#12056) for the _rpc log_failures kwarg idea; the broader restructure in that PR (nested per-chat loop inside send_typing) is avoided here in favour of stateful backoff that preserves base.py's existing _keep_typing architecture.

teknium1 · 2026-04-18T11:13:44Z

Thanks for digging into this — the Signal typing spam pattern you identified is real and your _rpc(log_failures=...) idea was the right seed. I salvaged that kwarg into #12118 along with per-chat failure-count tracking + an exponential cooldown (16s → 32s → 60s) so we stop the pointless RPCs as well as the log spam. You're credited in the commit message.

I went narrower on the restructure — the version in your PR adds a second per-chat asyncio loop inside send_typing on top of base.py's existing _keep_typing refresh loop, and the two interacting via Task cleanup is more coupling than the fix needs. Stateful backoff inside the existing architecture gets the same behaviour (E2E against the reported 41-minute window: 1230 → 45 RPCs, 1048 WARNINGs → 1 WARNING + 44 DEBUGs).

The session_search half isn't in #12118 — the user's logs show aux timeouts falling back to a local model, not 429s, so the 429 short-circuit wouldn't fire on that data. If a Paperclip user reports repeated 429s from the auxiliary provider we can revisit separately.

Closing in favour of #12118 (merged as 9527707). Appreciate the analysis.

…sResearch#12118) base.py's _keep_typing refresh loop calls send_typing every ~2s while the agent is processing. If signal-cli returns NETWORK_FAILURE for the recipient (offline, unroutable, group membership lost), the unmitigated path was a WARNING log every 2 seconds for as long as the agent stayed busy — a user report showed 1048 warnings in 41 minutes for one offline contact, plus the matching volume of pointless RPC traffic to signal-cli. - _rpc() accepts log_failures=False so callers can route repeated expected failures (typing) to DEBUG while keeping send/receive at WARNING. - send_typing() tracks consecutive failures per chat. First failure still logs WARNING so transport issues remain visible; subsequent failures log at DEBUG. After three consecutive failures we skip the RPC during an exponential cooldown (16s, 32s, 60s cap) so we stop hammering signal-cli for a recipient it can't deliver to. A successful sendTyping resets the counters. - _stop_typing_indicator() clears the backoff state so the next agent turn starts fresh. E2E simulation against the reported 41-minute window: RPCs drop from 1230 to 45 (-96%), log lines from 1048 WARNINGs to 1 WARNING + 44 DEBUGs. Credits kshitijk4poor (NousResearch#12056) for the _rpc log_failures kwarg idea; the broader restructure in that PR (nested per-chat loop inside send_typing) is avoided here in favour of stateful backoff that preserves base.py's existing _keep_typing architecture.

kshitijk4poor added 2 commits April 18, 2026 13:23

teknium1 mentioned this pull request Apr 18, 2026

fix(signal): back off sendTyping spam for unreachable recipients #12118

Merged

teknium1 closed this Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix paperclip recall fan-out and Signal typing retry spam#12056

Fix paperclip recall fan-out and Signal typing retry spam#12056
kshitijk4poor wants to merge 2 commits into
NousResearch:mainfrom
kshitijk4poor:fix-paperclip-adapter-loop

kshitijk4poor commented Apr 18, 2026

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kshitijk4poor commented Apr 18, 2026

Summary

Problem

Verification

Notes

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants