Skip to content

fix: improve Weixin delivery and model-generated progress updates#7915

Closed
JunityZz wants to merge 2 commits into
NousResearch:mainfrom
JunityZz:fix/weixin-progress-and-delivery
Closed

fix: improve Weixin delivery and model-generated progress updates#7915
JunityZz wants to merge 2 commits into
NousResearch:mainfrom
JunityZz:fix/weixin-progress-and-delivery

Conversation

@JunityZz

Copy link
Copy Markdown

Summary

  • reduce Weixin long-reply fragmentation so normal responses are delivered in far fewer bubbles
  • reserve Weixin reply budget for the final answer and suppress progress sends when quota is nearly exhausted
  • enable Weixin to forward model-generated pre-tool streaming commentary while keeping the final answer on the normal send path
  • fix no-edit stream consumer behavior so tool-boundary commentary is emitted as a complete segment instead of being truncated

Why

Personal Weixin/iLink appears to behave like a limited reply window after each user message. In practice this caused two UX problems:

  • long final answers were split into too many bubbles and got cut off in Weixin
  • model-generated "I'll check X first" commentary visible in CLI was not reliably reaching Weixin

This PR keeps final answers reliable while letting Weixin receive the model's own intermediate commentary segments.

Implementation notes

  • gateway/platforms/weixin.py
    • pack multiline replies into as few chunks as possible
    • cap outbound chunk count
    • track a per-chat reply budget reset on inbound messages
    • reserve one slot for the final answer
    • mark Weixin as supports_stream_edits = False and stream_intermediate_only = True
  • gateway/stream_consumer.py
    • for no-edit adapters, flush complete accumulated commentary at tool boundaries instead of sending partial fragments
    • skip streaming delivery of the final answer for adapters marked stream_intermediate_only
  • gateway/run.py
    • disable non-model tool/lifecycle progress messages for Weixin
    • force stream-consumer setup for Weixin so model-generated stream deltas can be surfaced
    • keep final-response delivery on the normal path even when intermediate commentary was streamed

Test plan

  • uv run --extra dev python -m pytest tests/gateway/test_stream_consumer.py tests/gateway/test_weixin.py -q

Notes

This intentionally avoids heuristic/tool-name-to-human-text rewriting for Weixin progress. Intermediate Weixin updates come from model-generated stream deltas, matching the CLI behavior more closely.

teknium1 added a commit that referenced this pull request Apr 11, 2026
… overhaul, activity tracking

Three root causes of the 'agent stops mid-task' gateway bug:

1. Compression threshold floor (64K tokens minimum)
   - The 50% threshold on a 100K-context model fired at 50K tokens,
     causing premature compression that made models lose track of
     multi-step plans.  Now threshold_tokens = max(50% * context, 64K).
   - Models with <64K context are rejected at startup with a clear error.

2. Budget warning removal — grace call instead
   - Removed the 70%/90% iteration budget warnings entirely.  These
     injected '[BUDGET WARNING: Provide your final response NOW]' into
     tool results, causing models to abandon complex tasks prematurely.
   - Now: no warnings during normal execution.  When the budget is
     actually exhausted (90/90), inject a user message asking the model
     to summarise, allow one grace API call, and only then fall back
     to _handle_max_iterations.

3. Activity touches during long terminal execution
   - _wait_for_process polls every 0.2s but never reported activity.
     The gateway's inactivity timeout (default 1800s) would fire during
     long-running commands that appeared 'idle.'
   - Now: thread-local activity callback fires every 10s during the
     poll loop, keeping the gateway's activity tracker alive.
   - Agent wires _touch_activity into the callback before each tool call.

Also: docs update noting 64K minimum context requirement.

Closes #7915 (root cause was agent-loop termination, not Weixin delivery limits).
@teknium1 teknium1 closed this in c8aff74 Apr 11, 2026
@teknium1

Copy link
Copy Markdown
Contributor

Thank you for the detailed investigation and clean implementation — the observations about Weixin delivery behavior were valuable.

After tracing the broader "agent stops mid-task" pattern across all gateway platforms, we found three root causes that affect every platform, not just Weixin:

  1. Premature context compression — the 50% threshold fired too early on models with ≤128K context, causing the model to lose track of multi-step plans after summarization
  2. Budget pressure warnings at 70%/90% of iterations — injected "Provide your final response NOW" into tool results, causing models to abandon complex tasks prematurely instead of continuing
  3. No activity reporting during long terminal execution — the gateway's inactivity monitor saw the agent as "idle" during long-running commands

These were addressed in PR #7983 (merged):

  • Compression threshold floor at 64K tokens (50% of context or 64K, whichever is higher)
  • Budget warnings removed entirely — model runs unimpeded until actual exhaustion, then gets one grace call to summarize
  • Activity callback fires every 10s during terminal command execution

The reply budget system, chunk capping, and stream_intermediate_only changes in this PR were solving a symptom rather than the cause. With the agent loop no longer terminating prematurely, Weixin should receive the complete final response without needing quota management.

If you still see delivery issues after this fix lands, please open a new issue — at that point it would genuinely be a Weixin-specific problem worth addressing separately.

Tommyeds pushed a commit to Tommyeds/hermes-agent that referenced this pull request Apr 12, 2026
… overhaul, activity tracking

Three root causes of the 'agent stops mid-task' gateway bug:

1. Compression threshold floor (64K tokens minimum)
   - The 50% threshold on a 100K-context model fired at 50K tokens,
     causing premature compression that made models lose track of
     multi-step plans.  Now threshold_tokens = max(50% * context, 64K).
   - Models with <64K context are rejected at startup with a clear error.

2. Budget warning removal — grace call instead
   - Removed the 70%/90% iteration budget warnings entirely.  These
     injected '[BUDGET WARNING: Provide your final response NOW]' into
     tool results, causing models to abandon complex tasks prematurely.
   - Now: no warnings during normal execution.  When the budget is
     actually exhausted (90/90), inject a user message asking the model
     to summarise, allow one grace API call, and only then fall back
     to _handle_max_iterations.

3. Activity touches during long terminal execution
   - _wait_for_process polls every 0.2s but never reported activity.
     The gateway's inactivity timeout (default 1800s) would fire during
     long-running commands that appeared 'idle.'
   - Now: thread-local activity callback fires every 10s during the
     poll loop, keeping the gateway's activity tracker alive.
   - Agent wires _touch_activity into the callback before each tool call.

Also: docs update noting 64K minimum context requirement.

Closes NousResearch#7915 (root cause was agent-loop termination, not Weixin delivery limits).
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 28, 2026
… overhaul, activity tracking

Three root causes of the 'agent stops mid-task' gateway bug:

1. Compression threshold floor (64K tokens minimum)
   - The 50% threshold on a 100K-context model fired at 50K tokens,
     causing premature compression that made models lose track of
     multi-step plans.  Now threshold_tokens = max(50% * context, 64K).
   - Models with <64K context are rejected at startup with a clear error.

2. Budget warning removal — grace call instead
   - Removed the 70%/90% iteration budget warnings entirely.  These
     injected '[BUDGET WARNING: Provide your final response NOW]' into
     tool results, causing models to abandon complex tasks prematurely.
   - Now: no warnings during normal execution.  When the budget is
     actually exhausted (90/90), inject a user message asking the model
     to summarise, allow one grace API call, and only then fall back
     to _handle_max_iterations.

3. Activity touches during long terminal execution
   - _wait_for_process polls every 0.2s but never reported activity.
     The gateway's inactivity timeout (default 1800s) would fire during
     long-running commands that appeared 'idle.'
   - Now: thread-local activity callback fires every 10s during the
     poll loop, keeping the gateway's activity tracker alive.
   - Agent wires _touch_activity into the callback before each tool call.

Also: docs update noting 64K minimum context requirement.

Closes NousResearch#7915 (root cause was agent-loop termination, not Weixin delivery limits).
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
… overhaul, activity tracking

Three root causes of the 'agent stops mid-task' gateway bug:

1. Compression threshold floor (64K tokens minimum)
   - The 50% threshold on a 100K-context model fired at 50K tokens,
     causing premature compression that made models lose track of
     multi-step plans.  Now threshold_tokens = max(50% * context, 64K).
   - Models with <64K context are rejected at startup with a clear error.

2. Budget warning removal — grace call instead
   - Removed the 70%/90% iteration budget warnings entirely.  These
     injected '[BUDGET WARNING: Provide your final response NOW]' into
     tool results, causing models to abandon complex tasks prematurely.
   - Now: no warnings during normal execution.  When the budget is
     actually exhausted (90/90), inject a user message asking the model
     to summarise, allow one grace API call, and only then fall back
     to _handle_max_iterations.

3. Activity touches during long terminal execution
   - _wait_for_process polls every 0.2s but never reported activity.
     The gateway's inactivity timeout (default 1800s) would fire during
     long-running commands that appeared 'idle.'
   - Now: thread-local activity callback fires every 10s during the
     poll loop, keeping the gateway's activity tracker alive.
   - Agent wires _touch_activity into the callback before each tool call.

Also: docs update noting 64K minimum context requirement.

Closes NousResearch#7915 (root cause was agent-loop termination, not Weixin delivery limits).
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
… overhaul, activity tracking

Three root causes of the 'agent stops mid-task' gateway bug:

1. Compression threshold floor (64K tokens minimum)
   - The 50% threshold on a 100K-context model fired at 50K tokens,
     causing premature compression that made models lose track of
     multi-step plans.  Now threshold_tokens = max(50% * context, 64K).
   - Models with <64K context are rejected at startup with a clear error.

2. Budget warning removal — grace call instead
   - Removed the 70%/90% iteration budget warnings entirely.  These
     injected '[BUDGET WARNING: Provide your final response NOW]' into
     tool results, causing models to abandon complex tasks prematurely.
   - Now: no warnings during normal execution.  When the budget is
     actually exhausted (90/90), inject a user message asking the model
     to summarise, allow one grace API call, and only then fall back
     to _handle_max_iterations.

3. Activity touches during long terminal execution
   - _wait_for_process polls every 0.2s but never reported activity.
     The gateway's inactivity timeout (default 1800s) would fire during
     long-running commands that appeared 'idle.'
   - Now: thread-local activity callback fires every 10s during the
     poll loop, keeping the gateway's activity tracker alive.
   - Agent wires _touch_activity into the callback before each tool call.

Also: docs update noting 64K minimum context requirement.

Closes NousResearch#7915 (root cause was agent-loop termination, not Weixin delivery limits).
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
… overhaul, activity tracking

Three root causes of the 'agent stops mid-task' gateway bug:

1. Compression threshold floor (64K tokens minimum)
   - The 50% threshold on a 100K-context model fired at 50K tokens,
     causing premature compression that made models lose track of
     multi-step plans.  Now threshold_tokens = max(50% * context, 64K).
   - Models with <64K context are rejected at startup with a clear error.

2. Budget warning removal — grace call instead
   - Removed the 70%/90% iteration budget warnings entirely.  These
     injected '[BUDGET WARNING: Provide your final response NOW]' into
     tool results, causing models to abandon complex tasks prematurely.
   - Now: no warnings during normal execution.  When the budget is
     actually exhausted (90/90), inject a user message asking the model
     to summarise, allow one grace API call, and only then fall back
     to _handle_max_iterations.

3. Activity touches during long terminal execution
   - _wait_for_process polls every 0.2s but never reported activity.
     The gateway's inactivity timeout (default 1800s) would fire during
     long-running commands that appeared 'idle.'
   - Now: thread-local activity callback fires every 10s during the
     poll loop, keeping the gateway's activity tracker alive.
   - Agent wires _touch_activity into the callback before each tool call.

Also: docs update noting 64K minimum context requirement.

Closes NousResearch#7915 (root cause was agent-loop termination, not Weixin delivery limits).
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
… overhaul, activity tracking

Three root causes of the 'agent stops mid-task' gateway bug:

1. Compression threshold floor (64K tokens minimum)
   - The 50% threshold on a 100K-context model fired at 50K tokens,
     causing premature compression that made models lose track of
     multi-step plans.  Now threshold_tokens = max(50% * context, 64K).
   - Models with <64K context are rejected at startup with a clear error.

2. Budget warning removal — grace call instead
   - Removed the 70%/90% iteration budget warnings entirely.  These
     injected '[BUDGET WARNING: Provide your final response NOW]' into
     tool results, causing models to abandon complex tasks prematurely.
   - Now: no warnings during normal execution.  When the budget is
     actually exhausted (90/90), inject a user message asking the model
     to summarise, allow one grace API call, and only then fall back
     to _handle_max_iterations.

3. Activity touches during long terminal execution
   - _wait_for_process polls every 0.2s but never reported activity.
     The gateway's inactivity timeout (default 1800s) would fire during
     long-running commands that appeared 'idle.'
   - Now: thread-local activity callback fires every 10s during the
     poll loop, keeping the gateway's activity tracker alive.
   - Agent wires _touch_activity into the callback before each tool call.

Also: docs update noting 64K minimum context requirement.

Closes NousResearch#7915 (root cause was agent-loop termination, not Weixin delivery limits).
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
… overhaul, activity tracking

Three root causes of the 'agent stops mid-task' gateway bug:

1. Compression threshold floor (64K tokens minimum)
   - The 50% threshold on a 100K-context model fired at 50K tokens,
     causing premature compression that made models lose track of
     multi-step plans.  Now threshold_tokens = max(50% * context, 64K).
   - Models with <64K context are rejected at startup with a clear error.

2. Budget warning removal — grace call instead
   - Removed the 70%/90% iteration budget warnings entirely.  These
     injected '[BUDGET WARNING: Provide your final response NOW]' into
     tool results, causing models to abandon complex tasks prematurely.
   - Now: no warnings during normal execution.  When the budget is
     actually exhausted (90/90), inject a user message asking the model
     to summarise, allow one grace API call, and only then fall back
     to _handle_max_iterations.

3. Activity touches during long terminal execution
   - _wait_for_process polls every 0.2s but never reported activity.
     The gateway's inactivity timeout (default 1800s) would fire during
     long-running commands that appeared 'idle.'
   - Now: thread-local activity callback fires every 10s during the
     poll loop, keeping the gateway's activity tracker alive.
   - Agent wires _touch_activity into the callback before each tool call.

Also: docs update noting 64K minimum context requirement.

Closes NousResearch#7915 (root cause was agent-loop termination, not Weixin delivery limits).
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
… overhaul, activity tracking

Three root causes of the 'agent stops mid-task' gateway bug:

1. Compression threshold floor (64K tokens minimum)
   - The 50% threshold on a 100K-context model fired at 50K tokens,
     causing premature compression that made models lose track of
     multi-step plans.  Now threshold_tokens = max(50% * context, 64K).
   - Models with <64K context are rejected at startup with a clear error.

2. Budget warning removal — grace call instead
   - Removed the 70%/90% iteration budget warnings entirely.  These
     injected '[BUDGET WARNING: Provide your final response NOW]' into
     tool results, causing models to abandon complex tasks prematurely.
   - Now: no warnings during normal execution.  When the budget is
     actually exhausted (90/90), inject a user message asking the model
     to summarise, allow one grace API call, and only then fall back
     to _handle_max_iterations.

3. Activity touches during long terminal execution
   - _wait_for_process polls every 0.2s but never reported activity.
     The gateway's inactivity timeout (default 1800s) would fire during
     long-running commands that appeared 'idle.'
   - Now: thread-local activity callback fires every 10s during the
     poll loop, keeping the gateway's activity tracker alive.
   - Agent wires _touch_activity into the callback before each tool call.

Also: docs update noting 64K minimum context requirement.

Closes NousResearch#7915 (root cause was agent-loop termination, not Weixin delivery limits).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants