Skip to content

fix(weixin): Replace cross-loop session reuse with thread-local persistent session#18251

Open
ikelvingo wants to merge 2 commits into
NousResearch:mainfrom
ikelvingo:main
Open

fix(weixin): Replace cross-loop session reuse with thread-local persistent session#18251
ikelvingo wants to merge 2 commits into
NousResearch:mainfrom
ikelvingo:main

Conversation

@ikelvingo

@ikelvingo ikelvingo commented May 1, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Symptom

  • WeChat image/file delivery via the send_message tool fails consistently with:
    RuntimeError: Timeout context manager should be used inside a task
  • The error occurs because aiohttp.TimerContext.current_task() returns None when the session was created on a different event loop than the one calling session.post().
  • Text-only messages sent through the gateway's native adapter path were unaffected (they run inside the gateway's own event loop).

Root Cause

  • v2026.4.23 introduced a "lazy recreate session" optimization in send_weixin_direct().
  • The optimization reused the gateway's long-lived aiohttp.ClientSession (from _LIVE_ADAPTERS) for one-shot sends triggered by the send_message tool.
  • These tool calls execute inside a ThreadPoolExecutor worker thread via _run_async().
  • The gateway's ClientSession was created on the gateway's event loop, but session.post(timeout=...) was called from the worker thread — a cross-event-loop access that aiohttp.TimerContext does not support.
  • TimerContext.__enter__() calls asyncio.current_task(loop=session._loop). Since no task is running on the gateway's loop inside the worker thread, it returns None, raising the RuntimeError.

Intent of the Original Change

  • Avoid re-creating TCP/SSL connections (handshake overhead) on every send_message tool invocation.
  • Reduce latency for repeated WeChat sends (text + image uploads make 3-4 API calls per message).

Risks of the Original Approach

  • Thread-safety violation: aiohttp.ClientSession is explicitly documented as not thread-safe.
  • SSL connection corruption: Concurrent writes to the same TCP connector socket can cause response cross-talk between worker threads.
  • Cookie/token state pollution: Session-level state can leak between concurrent sends for different users.
  • Authentication boundary break: Shared session flattens the isolation between different chat contexts.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

Fix

  1. Added _get_worker_session() to model_tools.py:

    • Each worker thread maintains its own persistent aiohttp.ClientSession via threading.local().
    • The session is bound to the thread's persistent event loop (_get_worker_loop()), so TimerContext.current_task() always finds a valid task.
    • Sessions are lazily created on first use and reused across subsequent calls within the same thread.
    • SSL connector setup mirrors _make_ssl_connector() from gateway/platforms/weixin.py.
  2. Removed lazy recreate branch from send_weixin_direct():

    • Deleted the _LIVE_ADAPTERS.get(resolved_token) shortcut (26 lines).
    • Replaced async with aiohttp.ClientSession(trust_env=True, connector=...) with session = _get_worker_session().
    • The adapter is still created fresh per call (adapter lifecycle unchanged).
    • Function reduced from 98 to 72 lines.

Behavior After Fix

  • Worker threads send WeChat messages through their own session, bound to their own event loop.
  • TCP/SSL connections are still reused within each thread — no per-call handshake penalty.
  • No cross-event-loop access: TimerContext.current_task() always returns the running task on the worker's loop.
  • The gateway's own adapter sending path is unchanged — it still uses its original session on the gateway loop.

Conflict Resolution

During review, a parallel approach was identified: replacing ClientTimeout with asyncio.wait_for() (chenlinfeng, May 3). This masks the TimerContext error but leaves the underlying ClientSession thread-safety violations intact.

Thread-local session (this PR) asyncio.wait_for
Fixes TimerContext error
Fixes ClientSession thread-safety
Fixes SSL connection corruption
Fixes auth/state boundary break
Retains per-thread connection reuse

The thread-local session approach eliminates all root-cause risks simultaneously. asyncio.wait_for() only replaces the error type (TimeoutError vs RuntimeError) without fixing the architectural problem.

Files Changed

  • model_tools.py (+13 lines): Added _get_worker_session() with thread-local persistent session management.
  • gateway/platforms/weixin.py (-21 lines): Removed lazy recreate block; replaced async with ClientSession with persistent worker session getter.

How to Test

Verification

  • send_message tool with WeChat image attachments: ✅ success: true, valid message_id returned.
  • Gateway logs: No RuntimeError or send failed entries post-fix.
  • Thread isolation confirmed: Different worker threads get different sessions; same thread reuses the same session.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform:

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

  • This skill is broadly useful to most users (if bundled) — see Contributing Guide
  • SKILL.md follows the standard format (frontmatter, trigger conditions, steps, pitfalls)
  • No external dependencies that aren't already available (prefer stdlib, curl, existing Hermes tools)
  • I've tested the skill end-to-end: hermes --toolsets skills -q "Use the X skill to do Y"

Screenshots / Logs

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/tools Tool registry, model_tools, toolsets platform/wecom WeCom / WeChat Work adapter labels May 1, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #14873, #14530, #12810 — same cross-event-loop aiohttp ClientSession reuse in send_weixin_direct(). At least 5 open PRs fix this identical root cause.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #14873, #14530, #12810.

@ikelvingo

Copy link
Copy Markdown
Contributor Author

Because in version v2026.4.23, a PR was merged that is neither secure nor compliant with aiohttp's usage guidelines, and this was also not fixed in version v2026.4.30.

@ikelvingo ikelvingo force-pushed the main branch 2 times, most recently from f998cad to 6fca9a3 Compare May 9, 2026 19:57
…stent session

Port of NousResearch#18251:
- model_tools.py: add _get_worker_session() with thread-local persistent
  aiohttp.ClientSession bound to each worker thread's event loop
- gateway/platforms/weixin.py: remove _LIVE_ADAPTERS lazy-reuse block in
  send_weixin_direct(), use _get_worker_session() instead

fix(honcho): sync_turn handles multimodal (list) content from image-bearing messages

Port of NousResearch#22768:
- plugins/memory/honcho/__init__.py: expand type annotation to str | list
  for user_content and assistant_content; add _extract_text() helper to
  extract text parts from multimodal lists before sanitize_context()
- image references are excluded from memory (memory providers should not
  store image URIs)
Port of upstream f24b7ed (fix: make Honcho startup fail open):
- plugins/memory/honcho/__init__.py: refactor startup to background thread
  via _start_session_init_background(), add _session_ready() guard, and
  new _init_thread/_init_lock/_init_error fields for fail-open behavior
- tests/test_honcho_startup_fail_open.py: add 357-line test suite

Port of NousResearch#22768 (fix(honcho): sync_urn crashes on list input from
image-bearing messages):
- plugins/memory/honcho/__init__.py: expand sync_urn type annotation
  to str | list for user_content and assistant_content; add _extract_text()
  helper to extract text parts from multimodal lists; image references
  are excluded from memory (memory providers should not store image URIs)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/tools Tool registry, model_tools, toolsets P2 Medium — degraded but workaround exists platform/wecom WeCom / WeChat Work adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants