fix: improve Weixin delivery and model-generated progress updates by JunityZz · Pull Request #7915 · NousResearch/hermes-agent

JunityZz · 2026-04-11T19:51:24Z

Summary

reduce Weixin long-reply fragmentation so normal responses are delivered in far fewer bubbles
reserve Weixin reply budget for the final answer and suppress progress sends when quota is nearly exhausted
enable Weixin to forward model-generated pre-tool streaming commentary while keeping the final answer on the normal send path
fix no-edit stream consumer behavior so tool-boundary commentary is emitted as a complete segment instead of being truncated

Why

Personal Weixin/iLink appears to behave like a limited reply window after each user message. In practice this caused two UX problems:

long final answers were split into too many bubbles and got cut off in Weixin
model-generated "I'll check X first" commentary visible in CLI was not reliably reaching Weixin

This PR keeps final answers reliable while letting Weixin receive the model's own intermediate commentary segments.

Implementation notes

gateway/platforms/weixin.py
- pack multiline replies into as few chunks as possible
- cap outbound chunk count
- track a per-chat reply budget reset on inbound messages
- reserve one slot for the final answer
- mark Weixin as supports_stream_edits = False and stream_intermediate_only = True
gateway/stream_consumer.py
- for no-edit adapters, flush complete accumulated commentary at tool boundaries instead of sending partial fragments
- skip streaming delivery of the final answer for adapters marked stream_intermediate_only
gateway/run.py
- disable non-model tool/lifecycle progress messages for Weixin
- force stream-consumer setup for Weixin so model-generated stream deltas can be surfaced
- keep final-response delivery on the normal path even when intermediate commentary was streamed

Test plan

uv run --extra dev python -m pytest tests/gateway/test_stream_consumer.py tests/gateway/test_weixin.py -q

Notes

This intentionally avoids heuristic/tool-name-to-human-text rewriting for Weixin progress. Intermediate Weixin updates come from model-generated stream deltas, matching the CLI behavior more closely.

… overhaul, activity tracking Three root causes of the 'agent stops mid-task' gateway bug: 1. Compression threshold floor (64K tokens minimum) - The 50% threshold on a 100K-context model fired at 50K tokens, causing premature compression that made models lose track of multi-step plans. Now threshold_tokens = max(50% * context, 64K). - Models with <64K context are rejected at startup with a clear error. 2. Budget warning removal — grace call instead - Removed the 70%/90% iteration budget warnings entirely. These injected '[BUDGET WARNING: Provide your final response NOW]' into tool results, causing models to abandon complex tasks prematurely. - Now: no warnings during normal execution. When the budget is actually exhausted (90/90), inject a user message asking the model to summarise, allow one grace API call, and only then fall back to _handle_max_iterations. 3. Activity touches during long terminal execution - _wait_for_process polls every 0.2s but never reported activity. The gateway's inactivity timeout (default 1800s) would fire during long-running commands that appeared 'idle.' - Now: thread-local activity callback fires every 10s during the poll loop, keeping the gateway's activity tracker alive. - Agent wires _touch_activity into the callback before each tool call. Also: docs update noting 64K minimum context requirement. Closes #7915 (root cause was agent-loop termination, not Weixin delivery limits).

teknium1 · 2026-04-11T23:19:19Z

Thank you for the detailed investigation and clean implementation — the observations about Weixin delivery behavior were valuable.

After tracing the broader "agent stops mid-task" pattern across all gateway platforms, we found three root causes that affect every platform, not just Weixin:

Premature context compression — the 50% threshold fired too early on models with ≤128K context, causing the model to lose track of multi-step plans after summarization
Budget pressure warnings at 70%/90% of iterations — injected "Provide your final response NOW" into tool results, causing models to abandon complex tasks prematurely instead of continuing
No activity reporting during long terminal execution — the gateway's inactivity monitor saw the agent as "idle" during long-running commands

These were addressed in PR #7983 (merged):

Compression threshold floor at 64K tokens (50% of context or 64K, whichever is higher)
Budget warnings removed entirely — model runs unimpeded until actual exhaustion, then gets one grace call to summarize
Activity callback fires every 10s during terminal command execution

The reply budget system, chunk capping, and stream_intermediate_only changes in this PR were solving a symptom rather than the cause. With the agent loop no longer terminating prematurely, Weixin should receive the complete final response without needing quota management.

If you still see delivery issues after this fix lands, please open a new issue — at that point it would genuinely be a Weixin-specific problem worth addressing separately.

… overhaul, activity tracking Three root causes of the 'agent stops mid-task' gateway bug: 1. Compression threshold floor (64K tokens minimum) - The 50% threshold on a 100K-context model fired at 50K tokens, causing premature compression that made models lose track of multi-step plans. Now threshold_tokens = max(50% * context, 64K). - Models with <64K context are rejected at startup with a clear error. 2. Budget warning removal — grace call instead - Removed the 70%/90% iteration budget warnings entirely. These injected '[BUDGET WARNING: Provide your final response NOW]' into tool results, causing models to abandon complex tasks prematurely. - Now: no warnings during normal execution. When the budget is actually exhausted (90/90), inject a user message asking the model to summarise, allow one grace API call, and only then fall back to _handle_max_iterations. 3. Activity touches during long terminal execution - _wait_for_process polls every 0.2s but never reported activity. The gateway's inactivity timeout (default 1800s) would fire during long-running commands that appeared 'idle.' - Now: thread-local activity callback fires every 10s during the poll loop, keeping the gateway's activity tracker alive. - Agent wires _touch_activity into the callback before each tool call. Also: docs update noting 64K minimum context requirement. Closes NousResearch#7915 (root cause was agent-loop termination, not Weixin delivery limits).

JunityZz added 2 commits April 12, 2026 03:49

fix: improve Weixin delivery and model-generated progress updates

da0d9df

fix: resolve Weixin merge conflicts with main

7e21d20

teknium1 mentioned this pull request Apr 11, 2026

fix: prevent agent from stopping mid-task — compression floor, budget overhaul, activity tracking #7983

Merged

teknium1 closed this in c8aff74 Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve Weixin delivery and model-generated progress updates#7915

fix: improve Weixin delivery and model-generated progress updates#7915
JunityZz wants to merge 2 commits into
NousResearch:mainfrom
JunityZz:fix/weixin-progress-and-delivery

JunityZz commented Apr 11, 2026

Uh oh!

teknium1 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JunityZz commented Apr 11, 2026

Summary

Why

Implementation notes

Test plan

Notes

Uh oh!

teknium1 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants