Skip to content

Fix paperclip recall fan-out and Signal typing retry spam#12056

Closed
kshitijk4poor wants to merge 2 commits into
NousResearch:mainfrom
kshitijk4poor:fix-paperclip-adapter-loop
Closed

Fix paperclip recall fan-out and Signal typing retry spam#12056
kshitijk4poor wants to merge 2 commits into
NousResearch:mainfrom
kshitijk4poor:fix-paperclip-adapter-loop

Conversation

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Summary

  • serialize session_search summaries and stop retrying on explicit 429 throttling
  • turn Signal typing into a throttled background loop so failed sendTyping RPCs do not respawn every 2 seconds
  • add regression coverage for session_search rate-limit handling and Signal typing backoff

Problem

The uploaded logs showed two separate loop patterns:

  • session_search fanned out multiple auxiliary summary requests at once, then retried through timeouts and 429 Too Many Requests, which is a bad fit for Paperclip-style integrations that repeatedly ask Hermes for recall
  • Signal sendTyping failures were logged on every refresh cycle, so the adapter stayed noisy and looked stuck while transport health was degraded

Verification

  • scripts/run_tests.sh tests/tools/test_session_search.py tests/gateway/test_signal.py -q
  • python -m py_compile gateway/platforms/signal.py tools/session_search_tool.py tests/gateway/test_signal.py tests/tools/test_session_search.py

Notes

  • I did not reproduce against a live Paperclip adapter or live Signal daemon; this fix is grounded in the uploaded debug report and targeted regression tests.

Paperclip-linked recall was fanning out session summaries in parallel, which triggered rate limits and timeouts in the uploaded logs. Signal typing failures were also logging every refresh cycle, so the adapter kept looking busy while transport health was degraded.

Constraint: Paperclip-style integrations tag tool sessions with source=tool and can trigger repeated session_search recall
Constraint: Signal transport failures should not flood gateway logs every two seconds
Rejected: Keep parallel session summarization with more retries | amplified 429/timeouts in the uploaded logs
Rejected: Disable Signal typing indicators entirely | loses useful UX when the transport is healthy
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: If session_search concurrency is raised again, verify auxiliary providers under rate-limit pressure before shipping
Tested: scripts/run_tests.sh tests/tools/test_session_search.py tests/gateway/test_signal.py -q
Tested: python -m py_compile gateway/platforms/signal.py tools/session_search_tool.py tests/gateway/test_signal.py tests/tools/test_session_search.py
Not-tested: Live Paperclip adapter against a real remote provider
Not-tested: Live Signal daemon/network failure behavior
The review follow-up found two behavioral gaps in the previous fix. Signal typing now distinguishes transport failure from successful JSON-RPC replies that carry a null result, and session_search keeps a bounded serial budget so slow providers degrade to partial raw previews instead of failing the whole tool call.

Constraint: signal-cli side-effect RPCs may succeed with null result payloads
Constraint: session_search must avoid turning slow auxiliary providers into whole-tool failures
Rejected: Treat any missing result as failure | breaks typing if sendTyping returns JSON null on success
Rejected: Leave serial session_search unbounded | can still hit the outer 300s sync-async timeout and lose partial work
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep Signal RPC semantics explicit per method; do not infer success from payload shape without adapter tests
Tested: /Users/kshitij/Projects/hermes-agent/scripts/run_tests.sh tests/tools/test_session_search.py tests/gateway/test_signal.py -q
Tested: python -m py_compile gateway/platforms/signal.py tools/session_search_tool.py tests/gateway/test_signal.py tests/tools/test_session_search.py
Not-tested: Live Paperclip adapter against a real remote provider
Not-tested: Live Signal daemon with JSON null sendTyping responses
teknium1 added a commit that referenced this pull request Apr 18, 2026
base.py's _keep_typing refresh loop calls send_typing every ~2s while
the agent is processing. If signal-cli returns NETWORK_FAILURE for the
recipient (offline, unroutable, group membership lost), the unmitigated
path was a WARNING log every 2 seconds for as long as the agent stayed
busy — a user report showed 1048 warnings in 41 minutes for one
offline contact, plus the matching volume of pointless RPC traffic to
signal-cli.

- _rpc() accepts log_failures=False so callers can route repeated
  expected failures (typing) to DEBUG while keeping send/receive at
  WARNING.
- send_typing() tracks consecutive failures per chat. First failure
  still logs WARNING so transport issues remain visible; subsequent
  failures log at DEBUG. After three consecutive failures we skip the
  RPC during an exponential cooldown (16s, 32s, 60s cap) so we stop
  hammering signal-cli for a recipient it can't deliver to. A
  successful sendTyping resets the counters.
- _stop_typing_indicator() clears the backoff state so the next agent
  turn starts fresh.

E2E simulation against the reported 41-minute window: RPCs drop from
1230 to 45 (-96%), log lines from 1048 WARNINGs to 1 WARNING + 44
DEBUGs.

Credits kshitijk4poor (#12056) for the _rpc log_failures kwarg idea;
the broader restructure in that PR (nested per-chat loop inside
send_typing) is avoided here in favour of stateful backoff that
preserves base.py's existing _keep_typing architecture.
teknium1 added a commit that referenced this pull request Apr 18, 2026
)

base.py's _keep_typing refresh loop calls send_typing every ~2s while
the agent is processing. If signal-cli returns NETWORK_FAILURE for the
recipient (offline, unroutable, group membership lost), the unmitigated
path was a WARNING log every 2 seconds for as long as the agent stayed
busy — a user report showed 1048 warnings in 41 minutes for one
offline contact, plus the matching volume of pointless RPC traffic to
signal-cli.

- _rpc() accepts log_failures=False so callers can route repeated
  expected failures (typing) to DEBUG while keeping send/receive at
  WARNING.
- send_typing() tracks consecutive failures per chat. First failure
  still logs WARNING so transport issues remain visible; subsequent
  failures log at DEBUG. After three consecutive failures we skip the
  RPC during an exponential cooldown (16s, 32s, 60s cap) so we stop
  hammering signal-cli for a recipient it can't deliver to. A
  successful sendTyping resets the counters.
- _stop_typing_indicator() clears the backoff state so the next agent
  turn starts fresh.

E2E simulation against the reported 41-minute window: RPCs drop from
1230 to 45 (-96%), log lines from 1048 WARNINGs to 1 WARNING + 44
DEBUGs.

Credits kshitijk4poor (#12056) for the _rpc log_failures kwarg idea;
the broader restructure in that PR (nested per-chat loop inside
send_typing) is avoided here in favour of stateful backoff that
preserves base.py's existing _keep_typing architecture.
@teknium1

Copy link
Copy Markdown
Contributor

Thanks for digging into this — the Signal typing spam pattern you identified is real and your _rpc(log_failures=...) idea was the right seed. I salvaged that kwarg into #12118 along with per-chat failure-count tracking + an exponential cooldown (16s → 32s → 60s) so we stop the pointless RPCs as well as the log spam. You're credited in the commit message.

I went narrower on the restructure — the version in your PR adds a second per-chat asyncio loop inside send_typing on top of base.py's existing _keep_typing refresh loop, and the two interacting via Task cleanup is more coupling than the fix needs. Stateful backoff inside the existing architecture gets the same behaviour (E2E against the reported 41-minute window: 1230 → 45 RPCs, 1048 WARNINGs → 1 WARNING + 44 DEBUGs).

The session_search half isn't in #12118 — the user's logs show aux timeouts falling back to a local model, not 429s, so the 429 short-circuit wouldn't fire on that data. If a Paperclip user reports repeated 429s from the auxiliary provider we can revisit separately.

Closing in favour of #12118 (merged as 9527707). Appreciate the analysis.

@teknium1 teknium1 closed this Apr 18, 2026
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…sResearch#12118)

base.py's _keep_typing refresh loop calls send_typing every ~2s while
the agent is processing. If signal-cli returns NETWORK_FAILURE for the
recipient (offline, unroutable, group membership lost), the unmitigated
path was a WARNING log every 2 seconds for as long as the agent stayed
busy — a user report showed 1048 warnings in 41 minutes for one
offline contact, plus the matching volume of pointless RPC traffic to
signal-cli.

- _rpc() accepts log_failures=False so callers can route repeated
  expected failures (typing) to DEBUG while keeping send/receive at
  WARNING.
- send_typing() tracks consecutive failures per chat. First failure
  still logs WARNING so transport issues remain visible; subsequent
  failures log at DEBUG. After three consecutive failures we skip the
  RPC during an exponential cooldown (16s, 32s, 60s cap) so we stop
  hammering signal-cli for a recipient it can't deliver to. A
  successful sendTyping resets the counters.
- _stop_typing_indicator() clears the backoff state so the next agent
  turn starts fresh.

E2E simulation against the reported 41-minute window: RPCs drop from
1230 to 45 (-96%), log lines from 1048 WARNINGs to 1 WARNING + 44
DEBUGs.

Credits kshitijk4poor (NousResearch#12056) for the _rpc log_failures kwarg idea;
the broader restructure in that PR (nested per-chat loop inside
send_typing) is avoided here in favour of stateful backoff that
preserves base.py's existing _keep_typing architecture.
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
…sResearch#12118)

base.py's _keep_typing refresh loop calls send_typing every ~2s while
the agent is processing. If signal-cli returns NETWORK_FAILURE for the
recipient (offline, unroutable, group membership lost), the unmitigated
path was a WARNING log every 2 seconds for as long as the agent stayed
busy — a user report showed 1048 warnings in 41 minutes for one
offline contact, plus the matching volume of pointless RPC traffic to
signal-cli.

- _rpc() accepts log_failures=False so callers can route repeated
  expected failures (typing) to DEBUG while keeping send/receive at
  WARNING.
- send_typing() tracks consecutive failures per chat. First failure
  still logs WARNING so transport issues remain visible; subsequent
  failures log at DEBUG. After three consecutive failures we skip the
  RPC during an exponential cooldown (16s, 32s, 60s cap) so we stop
  hammering signal-cli for a recipient it can't deliver to. A
  successful sendTyping resets the counters.
- _stop_typing_indicator() clears the backoff state so the next agent
  turn starts fresh.

E2E simulation against the reported 41-minute window: RPCs drop from
1230 to 45 (-96%), log lines from 1048 WARNINGs to 1 WARNING + 44
DEBUGs.

Credits kshitijk4poor (NousResearch#12056) for the _rpc log_failures kwarg idea;
the broader restructure in that PR (nested per-chat loop inside
send_typing) is avoided here in favour of stateful backoff that
preserves base.py's existing _keep_typing architecture.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…sResearch#12118)

base.py's _keep_typing refresh loop calls send_typing every ~2s while
the agent is processing. If signal-cli returns NETWORK_FAILURE for the
recipient (offline, unroutable, group membership lost), the unmitigated
path was a WARNING log every 2 seconds for as long as the agent stayed
busy — a user report showed 1048 warnings in 41 minutes for one
offline contact, plus the matching volume of pointless RPC traffic to
signal-cli.

- _rpc() accepts log_failures=False so callers can route repeated
  expected failures (typing) to DEBUG while keeping send/receive at
  WARNING.
- send_typing() tracks consecutive failures per chat. First failure
  still logs WARNING so transport issues remain visible; subsequent
  failures log at DEBUG. After three consecutive failures we skip the
  RPC during an exponential cooldown (16s, 32s, 60s cap) so we stop
  hammering signal-cli for a recipient it can't deliver to. A
  successful sendTyping resets the counters.
- _stop_typing_indicator() clears the backoff state so the next agent
  turn starts fresh.

E2E simulation against the reported 41-minute window: RPCs drop from
1230 to 45 (-96%), log lines from 1048 WARNINGs to 1 WARNING + 44
DEBUGs.

Credits kshitijk4poor (NousResearch#12056) for the _rpc log_failures kwarg idea;
the broader restructure in that PR (nested per-chat loop inside
send_typing) is avoided here in favour of stateful backoff that
preserves base.py's existing _keep_typing architecture.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…sResearch#12118)

base.py's _keep_typing refresh loop calls send_typing every ~2s while
the agent is processing. If signal-cli returns NETWORK_FAILURE for the
recipient (offline, unroutable, group membership lost), the unmitigated
path was a WARNING log every 2 seconds for as long as the agent stayed
busy — a user report showed 1048 warnings in 41 minutes for one
offline contact, plus the matching volume of pointless RPC traffic to
signal-cli.

- _rpc() accepts log_failures=False so callers can route repeated
  expected failures (typing) to DEBUG while keeping send/receive at
  WARNING.
- send_typing() tracks consecutive failures per chat. First failure
  still logs WARNING so transport issues remain visible; subsequent
  failures log at DEBUG. After three consecutive failures we skip the
  RPC during an exponential cooldown (16s, 32s, 60s cap) so we stop
  hammering signal-cli for a recipient it can't deliver to. A
  successful sendTyping resets the counters.
- _stop_typing_indicator() clears the backoff state so the next agent
  turn starts fresh.

E2E simulation against the reported 41-minute window: RPCs drop from
1230 to 45 (-96%), log lines from 1048 WARNINGs to 1 WARNING + 44
DEBUGs.

Credits kshitijk4poor (NousResearch#12056) for the _rpc log_failures kwarg idea;
the broader restructure in that PR (nested per-chat loop inside
send_typing) is avoided here in favour of stateful backoff that
preserves base.py's existing _keep_typing architecture.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…sResearch#12118)

base.py's _keep_typing refresh loop calls send_typing every ~2s while
the agent is processing. If signal-cli returns NETWORK_FAILURE for the
recipient (offline, unroutable, group membership lost), the unmitigated
path was a WARNING log every 2 seconds for as long as the agent stayed
busy — a user report showed 1048 warnings in 41 minutes for one
offline contact, plus the matching volume of pointless RPC traffic to
signal-cli.

- _rpc() accepts log_failures=False so callers can route repeated
  expected failures (typing) to DEBUG while keeping send/receive at
  WARNING.
- send_typing() tracks consecutive failures per chat. First failure
  still logs WARNING so transport issues remain visible; subsequent
  failures log at DEBUG. After three consecutive failures we skip the
  RPC during an exponential cooldown (16s, 32s, 60s cap) so we stop
  hammering signal-cli for a recipient it can't deliver to. A
  successful sendTyping resets the counters.
- _stop_typing_indicator() clears the backoff state so the next agent
  turn starts fresh.

E2E simulation against the reported 41-minute window: RPCs drop from
1230 to 45 (-96%), log lines from 1048 WARNINGs to 1 WARNING + 44
DEBUGs.

Credits kshitijk4poor (NousResearch#12056) for the _rpc log_failures kwarg idea;
the broader restructure in that PR (nested per-chat loop inside
send_typing) is avoided here in favour of stateful backoff that
preserves base.py's existing _keep_typing architecture.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants