Skip to content

fix(gateway): snapshot callback generation after agent binds it, not before#18219

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-ad6e171a
May 1, 2026
Merged

fix(gateway): snapshot callback generation after agent binds it, not before#18219
teknium1 merged 1 commit into
mainfrom
hermes/hermes-ad6e171a

Conversation

@teknium1

@teknium1 teknium1 commented May 1, 2026

Copy link
Copy Markdown
Contributor

Salvaged from #12565 (@Oxidane-bot) — just the callback-ownership portion. The /status totals half of that PR was already fixed on main in 7abc9ce via #17158.

Summary

Stale runs could fire a fresher run's post-delivery callback because the generation-ownership check was silently bypassed.

Root cause

_process_message_background in gateway/platforms/base.py snapshotted callback_generation at the top of the task:

interrupt_event = self._active_sessions.get(session_key) or asyncio.Event()
self._active_sessions[session_key] = interrupt_event
callback_generation = getattr(interrupt_event, "_hermes_run_generation", None)

But _hermes_run_generation is only set on the event by GatewayRunner._bind_adapter_run_generation during _handle_message_with_agent — which runs inside the await self._message_handler(event) below. The early snapshot always captured None.

That None then flowed into pop_post_delivery_callback(..., generation=None) in the finally block. Inside pop, generation=None with a tuple-registered entry bypasses the entry_generation != generation check, pops, and fires the callback regardless of which run owns it.

Fix

Move the snapshot into the finally block, after the handler has run and _hermes_run_generation has been bound.

Validation

New regression test: test_post_delivery_callback_generation_snapshot_happens_after_bind

  • Simulates a stale handler at generation=1 and a fresher callback registered at generation=2
  • Pre-fix: snapshot=None → pop fires the generation=2 callback under generation=1's ownership (fired == ['newer'])
  • Post-fix: snapshot=1 → pop skips the mismatched entry (fired == [])

Verified: test FAILS on current main, PASSES with this fix. Reverted the base.py change locally to confirm the test actually catches the bug.

Before After
Snapshot timing Before handler binds generation After handler binds generation
Ownership check Silently bypassed (always None) Enforced with real generation
115 tests in affected files 114 pass 115 pass

Credit

Authored by @Oxidane-bot (from #12565), with a Co-authored-by trailer. Also adds them to scripts/release.py AUTHOR_MAP.

…before

_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.

In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).

Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.

Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.

Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.

Salvaged from PR #12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via #17158).

Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
@teknium1 teknium1 merged commit 8d7500d into main May 1, 2026
10 of 11 checks passed
@teknium1 teknium1 deleted the hermes/hermes-ad6e171a branch May 1, 2026 03:41
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery labels May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants