fix(gateway): restore /status totals and preserve callback ownership by Oxidane-bot · Pull Request #12565 · NousResearch/hermes-agent

Oxidane-bot · 2026-04-19T12:59:47Z

Title: fix(gateway): restore /status totals and preserve callback ownership

What does this PR do?

This fixes two closely related gateway regressions in the session-delivery path:

restore the SessionStore compatibility mirror that /status relies on for session token totals
preserve deferred post-delivery callback ownership across interrupted / queued turns by reading the bound run generation at callback-pop time instead of snapshotting it too early

Today the gateway still renders /status from session_entry.total_tokens, but the post-run sync path only persists last_prompt_tokens. That lets the mirrored total drift to 0 even when the authoritative session accounting is non-zero.

Separately, generation-aware deferred callback cleanup could still race with interrupted/queued turns because the background task captured _hermes_run_generation before gateway/run.py finished binding the active run generation to the adapter session event. In that case callback cleanup could fall back to a stale/empty generation and clear a newer callback entry.

This PR fixes both by:

forwarding the authoritative cumulative total_tokens from _run_agent
mirroring that cumulative total back into SessionStore after completed runs
keeping SessionStore.update_session(session_key, value) backward-compatible by leaving the legacy positional second argument mapped to last_prompt_tokens and making mirrored total_tokens opt-in via keyword
preserving overwrite semantics so repeated runs do not double-count prior cumulative snapshots
avoiding clobbering an existing mirrored total when a failed/no-authoritative-total run reports total_tokens=0
reading _hermes_run_generation after the run is bound, so deferred callback cleanup only pops the callback owned by the current run generation

Related Issue

Fixes #5960
Regression context: earlier status-token fix #1476

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✅ Tests (adding or improving test coverage)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

widen gateway/session.py SessionStore.update_session() to accept mirrored total_tokens
make the new total_tokens parameter keyword-only so legacy positional update_session(session_key, last_prompt_tokens) callers keep their existing semantics
update gateway/run.py post-run sync to mirror authoritative cumulative total_tokens back into SessionStore alongside last_prompt_tokens
forward _agent.session_total_tokens from _run_agent so the gateway sync receives the authoritative cumulative total
guard failed/no-authoritative-total runs from clobbering an existing nonzero mirrored total
update gateway/platforms/base.py so deferred post-delivery callback cleanup reads the bound _hermes_run_generation in finally, after the handler has had a chance to attach current-run ownership
add regression coverage for:
- successful mirroring of cumulative totals
- /status showing restored totals after a completed run
- overwrite-not-add behavior across repeated runs
- failed/no-authoritative-total runs not clobbering an existing mirrored total
- SessionStore overwrite contract for total_tokens
- legacy positional update_session(session_key, value) calls continuing to update last_prompt_tokens
- deferred callback generation snapshot happening after bind so a newer callback entry is not popped by a stale run

How to Test

source venv/bin/activate
pytest tests/gateway/test_status_command.py::test_post_delivery_callback_generation_snapshot_happens_after_bind -q
pytest tests/gateway/test_status_command.py tests/gateway/test_session.py -q
python -m py_compile gateway/platforms/base.py gateway/run.py gateway/session.py tests/gateway/test_status_command.py tests/gateway/test_session.py
git diff --check
pytest tests/ -q

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: Ubuntu/Linux local dev environment

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A (no docs changes needed)
I've updated cli-config.yaml.example if I added/changed config keys — or N/A (no config changes)
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A (not needed for this fix)
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A (logic is platform-agnostic gateway/session mirroring and callback ownership)
I've updated tool descriptions/schemas if I changed tool behavior — or N/A (no tool schema changes)

Screenshots / Logs

pytest tests/gateway/test_status_command.py::test_post_delivery_callback_generation_snapshot_happens_after_bind -q → 1 passed
pytest tests/gateway/test_status_command.py tests/gateway/test_session.py -q → 77 passed
python -m py_compile gateway/platforms/base.py gateway/run.py gateway/session.py tests/gateway/test_status_command.py tests/gateway/test_session.py → passed
git diff --check → passed
pytest tests/ -q was also run, but does not pass cleanly in this environment on current upstream main:
- result: 11212 passed, 86 failed, 35 skipped
- the observed failures are outside the touched files and cluster in unrelated areas such as auxiliary client auth, memory/user-id plumbing, approval flows, Feishu, voice/audio, and Anthropic error handling

Restore the SessionStore total_tokens mirror used by /status so completed gateway runs no longer drift to Tokens: 0. Also fix the deferred post-delivery callback race by reading the bound run generation after the handler attaches ownership, preventing stale or empty generations from clearing a newer callback during interrupted or queued turns. Keep update_session backward-compatible by leaving the legacy positional second argument mapped to last_prompt_tokens and requiring mirrored total_tokens to be passed by keyword. Constraint: /status still reads SessionStore.total_tokens as a compatibility mirror Constraint: update_session(session_key, value) may still exist in downstream or older call sites Rejected: Reordering parameters without a keyword-only guard | still silently breaks positional callers Confidence: high Scope-risk: narrow Reversibility: clean Directive: Extend SessionStore metadata updates with keyword-only or appended args; do not repurpose legacy positional slots Tested: pytest tests/gateway/test_session.py::TestLastPromptTokens::test_update_session_positional_arg_preserves_last_prompt_tokens -q Tested: pytest tests/gateway/test_status_command.py tests/gateway/test_session.py -q Tested: python -m py_compile gateway/platforms/base.py gateway/run.py gateway/session.py tests/gateway/test_status_command.py tests/gateway/test_session.py Tested: git diff --check Not-tested: Full pytest suite is not green on current upstream main in this environment (86 unrelated baseline failures)

teknium1 · 2026-05-01T03:30:07Z

The /status totals portion of this PR is fixed on main in 7abc9ce via #17158 — /status now reads directly from SessionDB.

The other fix in this PR — snapshotting callback_generation before the agent can set _hermes_run_generation on the interrupt event, so post-delivery callbacks always register with a stale generation — is a real, separate bug. If you'd like to open a focused PR with just that fix (plus the test_post_delivery_callback_generation_snapshot_happens_after_bind regression test), we'd take it. Closing this one as the totals half is superseded. Thanks for both fixes.

…before _process_message_background snapshotted callback_generation from the interrupt event at the TOP of the task — before the handler ran. _hermes_run_generation is only set on the event by GatewayRunner._bind_adapter_run_generation during _handle_message_with_agent, which runs DURING the handler await. The early snapshot always captured None, which then flowed into pop_post_delivery_callback(..., generation=None) in the finally block. In pop_post_delivery_callback, generation=None with a tuple-registered entry (generation, callback) bypasses the ownership check — it pops and fires the callback regardless of which run owns it. Result: a stale run could fire a fresher run's post-delivery callback (e.g. a background-review notification attributed to the wrong turn). Fix: move the snapshot into the finally block, after the handler has run and _hermes_run_generation has been bound to the current run. Regression test added: simulates a stale handler at generation=1 and a fresher callback registered at generation=2. Pre-fix: snapshot=None → pop fires the generation=2 callback under generation=1's ownership ("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched entry, callback stays in the dict for the correct run to claim. Verified: test FAILS on current main (captures "newer" in fired list), PASSES with this fix. Salvaged from PR #12565 (the callback-ownership portion only; the /status totals portion was already fixed on main in 7abc9ce via #17158). Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>

teknium1 · 2026-05-01T03:41:32Z

Callback-ownership portion merged as 8d7500d via PR #18219. Your commit was picked up with authorship preserved and a Co-authored-by trailer; you're also added to scripts/release.py AUTHOR_MAP. Thanks for both fixes — this was a real dead-code bug (the generation-ownership guard was being silently bypassed on every run because callback_generation was always None at snapshot time). The /status totals portion was already fixed on main in 7abc9ce via #17158. #18219

…before _process_message_background snapshotted callback_generation from the interrupt event at the TOP of the task — before the handler ran. _hermes_run_generation is only set on the event by GatewayRunner._bind_adapter_run_generation during _handle_message_with_agent, which runs DURING the handler await. The early snapshot always captured None, which then flowed into pop_post_delivery_callback(..., generation=None) in the finally block. In pop_post_delivery_callback, generation=None with a tuple-registered entry (generation, callback) bypasses the ownership check — it pops and fires the callback regardless of which run owns it. Result: a stale run could fire a fresher run's post-delivery callback (e.g. a background-review notification attributed to the wrong turn). Fix: move the snapshot into the finally block, after the handler has run and _hermes_run_generation has been bound to the current run. Regression test added: simulates a stale handler at generation=1 and a fresher callback registered at generation=2. Pre-fix: snapshot=None → pop fires the generation=2 callback under generation=1's ownership ("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched entry, callback stays in the dict for the correct run to claim. Verified: test FAILS on current main (captures "newer" in fired list), PASSES with this fix. Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the /status totals portion was already fixed on main in 7abc9ce via NousResearch#17158). Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>

…before _process_message_background snapshotted callback_generation from the interrupt event at the TOP of the task — before the handler ran. _hermes_run_generation is only set on the event by GatewayRunner._bind_adapter_run_generation during _handle_message_with_agent, which runs DURING the handler await. The early snapshot always captured None, which then flowed into pop_post_delivery_callback(..., generation=None) in the finally block. In pop_post_delivery_callback, generation=None with a tuple-registered entry (generation, callback) bypasses the ownership check — it pops and fires the callback regardless of which run owns it. Result: a stale run could fire a fresher run's post-delivery callback (e.g. a background-review notification attributed to the wrong turn). Fix: move the snapshot into the finally block, after the handler has run and _hermes_run_generation has been bound to the current run. Regression test added: simulates a stale handler at generation=1 and a fresher callback registered at generation=2. Pre-fix: snapshot=None → pop fires the generation=2 callback under generation=1's ownership ("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched entry, callback stays in the dict for the correct run to claim. Verified: test FAILS on current main (captures "newer" in fired list), PASSES with this fix. Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the /status totals portion was already fixed on main in 00b427a via NousResearch#17158). Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>

…before _process_message_background snapshotted callback_generation from the interrupt event at the TOP of the task — before the handler ran. _hermes_run_generation is only set on the event by GatewayRunner._bind_adapter_run_generation during _handle_message_with_agent, which runs DURING the handler await. The early snapshot always captured None, which then flowed into pop_post_delivery_callback(..., generation=None) in the finally block. In pop_post_delivery_callback, generation=None with a tuple-registered entry (generation, callback) bypasses the ownership check — it pops and fires the callback regardless of which run owns it. Result: a stale run could fire a fresher run's post-delivery callback (e.g. a background-review notification attributed to the wrong turn). Fix: move the snapshot into the finally block, after the handler has run and _hermes_run_generation has been bound to the current run. Regression test added: simulates a stale handler at generation=1 and a fresher callback registered at generation=2. Pre-fix: snapshot=None → pop fires the generation=2 callback under generation=1's ownership ("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched entry, callback stays in the dict for the correct run to claim. Verified: test FAILS on current main (captures "newer" in fired list), PASSES with this fix. Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the /status totals portion was already fixed on main in 776cded via NousResearch#17158). Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>

…before _process_message_background snapshotted callback_generation from the interrupt event at the TOP of the task — before the handler ran. _hermes_run_generation is only set on the event by GatewayRunner._bind_adapter_run_generation during _handle_message_with_agent, which runs DURING the handler await. The early snapshot always captured None, which then flowed into pop_post_delivery_callback(..., generation=None) in the finally block. In pop_post_delivery_callback, generation=None with a tuple-registered entry (generation, callback) bypasses the ownership check — it pops and fires the callback regardless of which run owns it. Result: a stale run could fire a fresher run's post-delivery callback (e.g. a background-review notification attributed to the wrong turn). Fix: move the snapshot into the finally block, after the handler has run and _hermes_run_generation has been bound to the current run. Regression test added: simulates a stale handler at generation=1 and a fresher callback registered at generation=2. Pre-fix: snapshot=None → pop fires the generation=2 callback under generation=1's ownership ("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched entry, callback stays in the dict for the correct run to claim. Verified: test FAILS on current main (captures "newer" in fired list), PASSES with this fix. Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the /status totals portion was already fixed on main in 7abc9ce via NousResearch#17158). Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>

…before _process_message_background snapshotted callback_generation from the interrupt event at the TOP of the task — before the handler ran. _hermes_run_generation is only set on the event by GatewayRunner._bind_adapter_run_generation during _handle_message_with_agent, which runs DURING the handler await. The early snapshot always captured None, which then flowed into pop_post_delivery_callback(..., generation=None) in the finally block. In pop_post_delivery_callback, generation=None with a tuple-registered entry (generation, callback) bypasses the ownership check — it pops and fires the callback regardless of which run owns it. Result: a stale run could fire a fresher run's post-delivery callback (e.g. a background-review notification attributed to the wrong turn). Fix: move the snapshot into the finally block, after the handler has run and _hermes_run_generation has been bound to the current run. Regression test added: simulates a stale handler at generation=1 and a fresher callback registered at generation=2. Pre-fix: snapshot=None → pop fires the generation=2 callback under generation=1's ownership ("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched entry, callback stays in the dict for the correct run to claim. Verified: test FAILS on current main (captures "newer" in fired list), PASSES with this fix. Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the /status totals portion was already fixed on main in ab932fa via NousResearch#17158). Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Apr 23, 2026

alt-glitch mentioned this pull request Apr 30, 2026

fix(gateway): /status reads token totals from SessionDB #5989

Closed

3 tasks

teknium1 mentioned this pull request May 1, 2026

fix(gateway): /status reads token totals from SessionDB (regression of #1465) #18207

Closed

teknium1 closed this May 1, 2026

teknium1 mentioned this pull request May 1, 2026

fix(gateway): snapshot callback generation after agent binds it, not before #18219

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): restore /status totals and preserve callback ownership#12565

fix(gateway): restore /status totals and preserve callback ownership#12565
Oxidane-bot wants to merge 1 commit into
NousResearch:mainfrom
Oxidane-bot:fix/gateway-status-token-regression-5960

Oxidane-bot commented Apr 19, 2026

Uh oh!

teknium1 commented May 1, 2026

Uh oh!

teknium1 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Oxidane-bot commented Apr 19, 2026

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Uh oh!

teknium1 commented May 1, 2026

Uh oh!

teknium1 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants