Skip to content

fix(gateway): restore /status totals and preserve callback ownership#12565

Closed
Oxidane-bot wants to merge 1 commit into
NousResearch:mainfrom
Oxidane-bot:fix/gateway-status-token-regression-5960
Closed

fix(gateway): restore /status totals and preserve callback ownership#12565
Oxidane-bot wants to merge 1 commit into
NousResearch:mainfrom
Oxidane-bot:fix/gateway-status-token-regression-5960

Conversation

@Oxidane-bot

Copy link
Copy Markdown
Contributor

Title: fix(gateway): restore /status totals and preserve callback ownership

What does this PR do?

This fixes two closely related gateway regressions in the session-delivery path:

  1. restore the SessionStore compatibility mirror that /status relies on for session token totals
  2. preserve deferred post-delivery callback ownership across interrupted / queued turns by reading the bound run generation at callback-pop time instead of snapshotting it too early

Today the gateway still renders /status from session_entry.total_tokens, but the post-run sync path only persists last_prompt_tokens. That lets the mirrored total drift to 0 even when the authoritative session accounting is non-zero.

Separately, generation-aware deferred callback cleanup could still race with interrupted/queued turns because the background task captured _hermes_run_generation before gateway/run.py finished binding the active run generation to the adapter session event. In that case callback cleanup could fall back to a stale/empty generation and clear a newer callback entry.

This PR fixes both by:

  • forwarding the authoritative cumulative total_tokens from _run_agent
  • mirroring that cumulative total back into SessionStore after completed runs
  • keeping SessionStore.update_session(session_key, value) backward-compatible by leaving the legacy positional second argument mapped to last_prompt_tokens and making mirrored total_tokens opt-in via keyword
  • preserving overwrite semantics so repeated runs do not double-count prior cumulative snapshots
  • avoiding clobbering an existing mirrored total when a failed/no-authoritative-total run reports total_tokens=0
  • reading _hermes_run_generation after the run is bound, so deferred callback cleanup only pops the callback owned by the current run generation

Related Issue

Fixes #5960
Regression context: earlier status-token fix #1476

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✅ Tests (adding or improving test coverage)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • widen gateway/session.py SessionStore.update_session() to accept mirrored total_tokens
  • make the new total_tokens parameter keyword-only so legacy positional update_session(session_key, last_prompt_tokens) callers keep their existing semantics
  • update gateway/run.py post-run sync to mirror authoritative cumulative total_tokens back into SessionStore alongside last_prompt_tokens
  • forward _agent.session_total_tokens from _run_agent so the gateway sync receives the authoritative cumulative total
  • guard failed/no-authoritative-total runs from clobbering an existing nonzero mirrored total
  • update gateway/platforms/base.py so deferred post-delivery callback cleanup reads the bound _hermes_run_generation in finally, after the handler has had a chance to attach current-run ownership
  • add regression coverage for:
    • successful mirroring of cumulative totals
    • /status showing restored totals after a completed run
    • overwrite-not-add behavior across repeated runs
    • failed/no-authoritative-total runs not clobbering an existing mirrored total
    • SessionStore overwrite contract for total_tokens
    • legacy positional update_session(session_key, value) calls continuing to update last_prompt_tokens
    • deferred callback generation snapshot happening after bind so a newer callback entry is not popped by a stale run

How to Test

  1. source venv/bin/activate
  2. pytest tests/gateway/test_status_command.py::test_post_delivery_callback_generation_snapshot_happens_after_bind -q
  3. pytest tests/gateway/test_status_command.py tests/gateway/test_session.py -q
  4. python -m py_compile gateway/platforms/base.py gateway/run.py gateway/session.py tests/gateway/test_status_command.py tests/gateway/test_session.py
  5. git diff --check
  6. pytest tests/ -q

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Ubuntu/Linux local dev environment

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A (no docs changes needed)
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A (no config changes)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A (not needed for this fix)
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A (logic is platform-agnostic gateway/session mirroring and callback ownership)
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A (no tool schema changes)

Screenshots / Logs

  • pytest tests/gateway/test_status_command.py::test_post_delivery_callback_generation_snapshot_happens_after_bind -q1 passed
  • pytest tests/gateway/test_status_command.py tests/gateway/test_session.py -q77 passed
  • python -m py_compile gateway/platforms/base.py gateway/run.py gateway/session.py tests/gateway/test_status_command.py tests/gateway/test_session.py → passed
  • git diff --check → passed
  • pytest tests/ -q was also run, but does not pass cleanly in this environment on current upstream main:
    • result: 11212 passed, 86 failed, 35 skipped
    • the observed failures are outside the touched files and cluster in unrelated areas such as auxiliary client auth, memory/user-id plumbing, approval flows, Feishu, voice/audio, and Anthropic error handling

Restore the SessionStore total_tokens mirror used by /status so completed gateway runs no longer drift to Tokens: 0.

Also fix the deferred post-delivery callback race by reading the bound run generation after the handler attaches ownership, preventing stale or empty generations from clearing a newer callback during interrupted or queued turns.

Keep update_session backward-compatible by leaving the legacy positional second argument mapped to last_prompt_tokens and requiring mirrored total_tokens to be passed by keyword.

Constraint: /status still reads SessionStore.total_tokens as a compatibility mirror
Constraint: update_session(session_key, value) may still exist in downstream or older call sites
Rejected: Reordering parameters without a keyword-only guard | still silently breaks positional callers
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Extend SessionStore metadata updates with keyword-only or appended args; do not repurpose legacy positional slots
Tested: pytest tests/gateway/test_session.py::TestLastPromptTokens::test_update_session_positional_arg_preserves_last_prompt_tokens -q
Tested: pytest tests/gateway/test_status_command.py tests/gateway/test_session.py -q
Tested: python -m py_compile gateway/platforms/base.py gateway/run.py gateway/session.py tests/gateway/test_status_command.py tests/gateway/test_session.py
Tested: git diff --check
Not-tested: Full pytest suite is not green on current upstream main in this environment (86 unrelated baseline failures)
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Apr 23, 2026
@teknium1

teknium1 commented May 1, 2026

Copy link
Copy Markdown
Contributor

The /status totals portion of this PR is fixed on main in 7abc9ce via #17158 — /status now reads directly from SessionDB.

The other fix in this PR — snapshotting callback_generation before the agent can set _hermes_run_generation on the interrupt event, so post-delivery callbacks always register with a stale generation — is a real, separate bug. If you'd like to open a focused PR with just that fix (plus the test_post_delivery_callback_generation_snapshot_happens_after_bind regression test), we'd take it. Closing this one as the totals half is superseded. Thanks for both fixes.

@teknium1 teknium1 closed this May 1, 2026
teknium1 pushed a commit that referenced this pull request May 1, 2026
…before

_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.

In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).

Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.

Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.

Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.

Salvaged from PR #12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via #17158).

Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
teknium1 pushed a commit that referenced this pull request May 1, 2026
…before

_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.

In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).

Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.

Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.

Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.

Salvaged from PR #12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via #17158).

Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
@teknium1

teknium1 commented May 1, 2026

Copy link
Copy Markdown
Contributor

Callback-ownership portion merged as 8d7500d via PR #18219. Your commit was picked up with authorship preserved and a Co-authored-by trailer; you're also added to scripts/release.py AUTHOR_MAP. Thanks for both fixes — this was a real dead-code bug (the generation-ownership guard was being silently bypassed on every run because callback_generation was always None at snapshot time). The /status totals portion was already fixed on main in 7abc9ce via #17158. #18219

donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
…before

_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.

In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).

Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.

Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.

Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.

Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via NousResearch#17158).

Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
nickdlkk pushed a commit to nickdlkk/hermes-agent that referenced this pull request May 11, 2026
…before

_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.

In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).

Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.

Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.

Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.

Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via NousResearch#17158).

Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
…before

_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.

In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).

Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.

Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.

Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.

Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 00b427a via NousResearch#17158).

Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
…before

_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.

In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).

Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.

Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.

Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.

Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 776cded via NousResearch#17158).

Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…before

_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.

In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).

Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.

Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.

Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.

Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via NousResearch#17158).

Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…before

_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.

In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).

Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.

Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.

Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.

Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in ab932fa via NousResearch#17158).

Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: /status Tokens: 0 regressed — SessionStore totals drift from SessionDB/state.db

3 participants