fix(gateway): restore /status totals and preserve callback ownership#12565
fix(gateway): restore /status totals and preserve callback ownership#12565Oxidane-bot wants to merge 1 commit into
Conversation
Restore the SessionStore total_tokens mirror used by /status so completed gateway runs no longer drift to Tokens: 0. Also fix the deferred post-delivery callback race by reading the bound run generation after the handler attaches ownership, preventing stale or empty generations from clearing a newer callback during interrupted or queued turns. Keep update_session backward-compatible by leaving the legacy positional second argument mapped to last_prompt_tokens and requiring mirrored total_tokens to be passed by keyword. Constraint: /status still reads SessionStore.total_tokens as a compatibility mirror Constraint: update_session(session_key, value) may still exist in downstream or older call sites Rejected: Reordering parameters without a keyword-only guard | still silently breaks positional callers Confidence: high Scope-risk: narrow Reversibility: clean Directive: Extend SessionStore metadata updates with keyword-only or appended args; do not repurpose legacy positional slots Tested: pytest tests/gateway/test_session.py::TestLastPromptTokens::test_update_session_positional_arg_preserves_last_prompt_tokens -q Tested: pytest tests/gateway/test_status_command.py tests/gateway/test_session.py -q Tested: python -m py_compile gateway/platforms/base.py gateway/run.py gateway/session.py tests/gateway/test_status_command.py tests/gateway/test_session.py Tested: git diff --check Not-tested: Full pytest suite is not green on current upstream main in this environment (86 unrelated baseline failures)
|
The /status totals portion of this PR is fixed on main in 7abc9ce via #17158 — /status now reads directly from SessionDB. The other fix in this PR — snapshotting |
…before
_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.
In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).
Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.
Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.
Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.
Salvaged from PR #12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via #17158).
Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
…before
_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.
In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).
Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.
Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.
Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.
Salvaged from PR #12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via #17158).
Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
|
Callback-ownership portion merged as 8d7500d via PR #18219. Your commit was picked up with authorship preserved and a Co-authored-by trailer; you're also added to |
…before
_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.
In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).
Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.
Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.
Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.
Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via NousResearch#17158).
Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
…before
_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.
In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).
Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.
Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.
Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.
Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via NousResearch#17158).
Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
…before
_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.
In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).
Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.
Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.
Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.
Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 00b427a via NousResearch#17158).
Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
…before
_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.
In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).
Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.
Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.
Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.
Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 776cded via NousResearch#17158).
Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
…before
_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.
In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).
Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.
Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.
Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.
Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via NousResearch#17158).
Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
…before
_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.
In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).
Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.
Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.
Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.
Salvaged from PR NousResearch#12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in ab932fa via NousResearch#17158).
Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
Title: fix(gateway): restore /status totals and preserve callback ownership
What does this PR do?
This fixes two closely related gateway regressions in the session-delivery path:
/statusrelies on for session token totalsToday the gateway still renders
/statusfromsession_entry.total_tokens, but the post-run sync path only persistslast_prompt_tokens. That lets the mirrored total drift to0even when the authoritative session accounting is non-zero.Separately, generation-aware deferred callback cleanup could still race with interrupted/queued turns because the background task captured
_hermes_run_generationbeforegateway/run.pyfinished binding the active run generation to the adapter session event. In that case callback cleanup could fall back to a stale/empty generation and clear a newer callback entry.This PR fixes both by:
total_tokensfrom_run_agentSessionStoreafter completed runsSessionStore.update_session(session_key, value)backward-compatible by leaving the legacy positional second argument mapped tolast_prompt_tokensand making mirroredtotal_tokensopt-in via keywordtotal_tokens=0_hermes_run_generationafter the run is bound, so deferred callback cleanup only pops the callback owned by the current run generationRelated Issue
Fixes #5960
Regression context: earlier status-token fix #1476
Type of Change
Changes Made
gateway/session.pySessionStore.update_session()to accept mirroredtotal_tokenstotal_tokensparameter keyword-only so legacy positionalupdate_session(session_key, last_prompt_tokens)callers keep their existing semanticsgateway/run.pypost-run sync to mirror authoritative cumulativetotal_tokensback into SessionStore alongsidelast_prompt_tokens_agent.session_total_tokensfrom_run_agentso the gateway sync receives the authoritative cumulative totalgateway/platforms/base.pyso deferred post-delivery callback cleanup reads the bound_hermes_run_generationinfinally, after the handler has had a chance to attach current-run ownership/statusshowing restored totals after a completed runtotal_tokensupdate_session(session_key, value)calls continuing to updatelast_prompt_tokensHow to Test
source venv/bin/activatepytest tests/gateway/test_status_command.py::test_post_delivery_callback_generation_snapshot_happens_after_bind -qpytest tests/gateway/test_status_command.py tests/gateway/test_session.py -qpython -m py_compile gateway/platforms/base.py gateway/run.py gateway/session.py tests/gateway/test_status_command.py tests/gateway/test_session.pygit diff --checkpytest tests/ -qChecklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/A (no docs changes needed)cli-config.yaml.exampleif I added/changed config keys — or N/A (no config changes)CONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/A (not needed for this fix)Screenshots / Logs
pytest tests/gateway/test_status_command.py::test_post_delivery_callback_generation_snapshot_happens_after_bind -q→1 passedpytest tests/gateway/test_status_command.py tests/gateway/test_session.py -q→77 passedpython -m py_compile gateway/platforms/base.py gateway/run.py gateway/session.py tests/gateway/test_status_command.py tests/gateway/test_session.py→ passedgit diff --check→ passedpytest tests/ -qwas also run, but does not pass cleanly in this environment on current upstreammain:11212 passed, 86 failed, 35 skipped