Skip to content

fix(gateway): keep auto-continue recovery ephemeral#25561

Open
qWaitCrypto wants to merge 1 commit into
NousResearch:mainfrom
qWaitCrypto:fix/gateway-autocontinue-compression
Open

fix(gateway): keep auto-continue recovery ephemeral#25561
qWaitCrypto wants to merge 1 commit into
NousResearch:mainfrom
qWaitCrypto:fix/gateway-autocontinue-compression

Conversation

@qWaitCrypto

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes gateway auto-continue recovery so an interrupt recovery hint stays ephemeral and cannot be persisted or compacted into long-lived session context.

When a gateway session was interrupted with a trailing tool result, the next turn prepended an auto-continue system note directly into the user message. If that same turn triggered preflight compression, the synthetic note could be saved into the compressed child session and replayed later as if the user had authored it.

This PR keeps gateway recovery notes API-only via persist_user_message, records a durable acknowledgement for each recovered trailing tool batch, and propagates compression-created session_id changes even when the interrupted turn returns no final response.

Related Issue

Fixes #25242

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/run.py
    • Computes a stable hash key for the trailing consecutive tool-result batch.
    • Skips repeated inferred tool-tail recovery when the same batch was already acknowledged.
    • Passes clean user text through persist_user_message so gateway recovery notes remain API-only and do not persist into transcript history or compression summaries.
    • Syncs compression-rotated agent.session_id before empty/interrupted-result early returns.
  • gateway/session.py
    • Persists auto_continue_tool_tail_key and auto_continue_tool_tail_ack_at on SessionEntry.
    • Adds mark_auto_continue_tool_tail_ack() for successful recovery attempts that reached the model.
  • run_agent.py
    • Checks an already-requested interrupt before starting preflight compression.
  • tests/gateway/test_auto_continue_recovery.py
    • Adds regression coverage for stable tool-tail keys, session-entry round trips, and one-shot recovery acknowledgement.

How to Test

  1. Reproduced the pre-fix persistence shape with the old gateway behavior:

    python - <<'PY'
    note = "[System note: Your previous turn was interrupted before you could process the last tool result(s).]"
    user_message = "new user request"
    prefixed = f"{note}\n\n{user_message}"
    persisted_without_fix = prefixed
    print("before_contains_note=", "[System note:" in persisted_without_fix)
    print("before_persisted=", persisted_without_fix)
    PY

    Result before fix:

    before_contains_note= True
    before_persisted= [System note: Your previous turn was interrupted before you could process the last tool result(s).]
    
    new user request
    
  2. Verified the fixed recovery state blocks replay of the same trailing tool batch and preserves clean user text:

    python - <<'PY'
    from gateway.run import _gateway_tool_tail_recovery_key, _gateway_tool_tail_recovery_ack_matches
    from gateway.session import SessionEntry, Platform
    from datetime import datetime
    history = [
        {"role": "assistant", "tool_calls": [{"id": "call_1"}]},
        {"role": "tool", "tool_call_id": "call_1", "content": "tool output"},
    ]
    key = _gateway_tool_tail_recovery_key(history)
    entry = SessionEntry(
        session_key="sk", session_id="sid", created_at=datetime.now(), updated_at=datetime.now(),
        platform=Platform.TELEGRAM, auto_continue_tool_tail_key=key,
        auto_continue_tool_tail_ack_at=datetime.now(),
    )
    print("after_tail_key=", bool(key))
    print("after_ack_blocks_replay=", _gateway_tool_tail_recovery_ack_matches(entry, key))
    print("after_persist_user_message=", "new user request")
    PY

    Result after fix:

    after_tail_key= True
    after_ack_blocks_replay= True
    after_persist_user_message= new user request
    
  3. Ran syntax checks:

    python -m py_compile gateway/run.py gateway/session.py run_agent.py tests/gateway/test_auto_continue_recovery.py

    Result: passed.

  4. Ran focused gateway/session regression tests:

    /tmp/hermes-provider-refresh-venv/bin/python -m pytest -o addopts= tests/gateway/test_auto_continue_recovery.py tests/test_lazy_session_regressions.py

    Result:

    20 passed in 10.67s
    
  5. Ran the existing persisted-user-message override test with a writable temporary Hermes home:

    HERMES_HOME=/tmp/hermes-test-home /tmp/hermes-provider-refresh-venv/bin/python -m pytest -o addopts= tests/run_agent/test_run_agent.py::TestPersistUserMessageOverride

    Result:

    1 passed in 8.06s
    
  6. Attempted broader gateway usage tests:

    timeout 60 /tmp/hermes-provider-refresh-venv/bin/python -m pytest -o addopts= tests/gateway/test_usage_command.py -x -vv

    Result: timed out after collecting and starting TestUsageCachedAgent::test_cached_agent_shows_detailed_usage; no assertion output was produced before timeout. This appears unrelated to the changed recovery/session paths.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Linux / WSL, Python 3.13

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

Relevant reproduction output and test logs are included in "How to Test" above.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/agent Core agent loop, run_agent.py, prompt builder labels May 14, 2026
@qWaitCrypto qWaitCrypto force-pushed the fix/gateway-autocontinue-compression branch from a9cdf0c to b2436b4 Compare May 19, 2026 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway auto-continue note can be persisted and amplified by interrupt-triggered preflight compression

2 participants