fix(gateway): snapshot callback generation after agent binds it, not before#18219
Merged
Conversation
…before
_process_message_background snapshotted callback_generation from the
interrupt event at the TOP of the task — before the handler ran.
_hermes_run_generation is only set on the event by
GatewayRunner._bind_adapter_run_generation during
_handle_message_with_agent, which runs DURING the handler await. The
early snapshot always captured None, which then flowed into
pop_post_delivery_callback(..., generation=None) in the finally block.
In pop_post_delivery_callback, generation=None with a tuple-registered
entry (generation, callback) bypasses the ownership check — it pops and
fires the callback regardless of which run owns it. Result: a stale run
could fire a fresher run's post-delivery callback (e.g. a
background-review notification attributed to the wrong turn).
Fix: move the snapshot into the finally block, after the handler has
run and _hermes_run_generation has been bound to the current run.
Regression test added: simulates a stale handler at generation=1 and a
fresher callback registered at generation=2. Pre-fix: snapshot=None →
pop fires the generation=2 callback under generation=1's ownership
("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched
entry, callback stays in the dict for the correct run to claim.
Verified: test FAILS on current main (captures "newer" in fired list),
PASSES with this fix.
Salvaged from PR #12565 (the callback-ownership portion only; the
/status totals portion was already fixed on main in 7abc9ce via #17158).
Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>
19 tasks
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Salvaged from #12565 (@Oxidane-bot) — just the callback-ownership portion. The /status totals half of that PR was already fixed on main in 7abc9ce via #17158.
Summary
Stale runs could fire a fresher run's post-delivery callback because the generation-ownership check was silently bypassed.
Root cause
_process_message_backgroundingateway/platforms/base.pysnapshottedcallback_generationat the top of the task:But
_hermes_run_generationis only set on the event byGatewayRunner._bind_adapter_run_generationduring_handle_message_with_agent— which runs inside theawait self._message_handler(event)below. The early snapshot always capturedNone.That
Nonethen flowed intopop_post_delivery_callback(..., generation=None)in the finally block. Inside pop,generation=Nonewith a tuple-registered entry bypasses theentry_generation != generationcheck, pops, and fires the callback regardless of which run owns it.Fix
Move the snapshot into the
finallyblock, after the handler has run and_hermes_run_generationhas been bound.Validation
New regression test:
test_post_delivery_callback_generation_snapshot_happens_after_bindfired == ['newer'])fired == [])Verified: test FAILS on current main, PASSES with this fix. Reverted the
base.pychange locally to confirm the test actually catches the bug.Credit
Authored by @Oxidane-bot (from #12565), with a Co-authored-by trailer. Also adds them to
scripts/release.pyAUTHOR_MAP.