fix(adapter): clear _pending_enqueued_at on teardown + cancellation (#33)

linxule · linxule · commit 268daf12bacc · 2026-04-27T10:35:58.000+01:00
The Kimi adapter maintains _pending_enqueued_at as a parallel TTL dict
alongside the base class's _pending_messages and _active_sessions
(gateway/platforms/base.py). The base clears the latter two during
cancel_background_tasks (base.py:2553-2554), but our subclass's
parallel dict is untouched — entries leak across reconnects since
the gateway reuses the same adapter instance (gateway/run.py:2725-2729
calls cancel_background_tasks then disconnect on the same instance,
and the adapter is then re-used for the next connect).

Three layered guarantees close the leak:

1. cancel_background_tasks override mirrors the base's clear() —
   this is the primary fix. super() runs the drain (await
   asyncio.gather over _background_tasks) and clears base state;
   our override then clears _pending_enqueued_at as a final sweep.
   Order matters: super() first, our clear() last. Reversed, an
   in-flight handler's finally block could re-insert a key after
   our clear during the gather. With this order, every handler's
   finally has already run and self-popped via the
   "if session_key not in _pending_messages" guard.

2. disconnect() also clears as defense-in-depth for direct-
   disconnect call sites that bypass cancel_background_tasks
   (gateway/run.py:_safe_adapter_disconnect, the error-recovery
   branch in connect()). Documented limitation: those direct paths
   don't clear base _pending_messages or _active_sessions either —
   pre-existing behaviour, out of scope.

3. handle_message wraps the post-super cleanup in try/finally so
   any unexpected exception from super (most relevantly a
   CancelledError propagating up from base.handle_message) doesn't
   skip the cleanup. The cleanup guard is deterministic under
   cancellation because base's _pending_messages writes happen
   synchronously with no await between the write and the function
   return — so the guard observes the slot in one of two known
   states: empty (pop) or owned by a follow-up (preserve fresh ts).

Plus connect() clears _pending_enqueued_at at session start as
belt-and-braces against any future code path that bypasses the
teardown machinery entirely (e.g. partial-init reuse). Cheap and
idempotent.

Three-way review feedback applied:
- Codex caught that my original docstring overstated the consequence
  (a stale-only timestamp can't actually evict a fresh message — the
  TTL guard at line ~1247 first does _pending_messages.get and bails
  if None). Reworded as "memory hygiene, not correctness".
- Claude caught that the try/finally docstring rationale was
  misdirected (cancel_background_tasks cancels _background_tasks,
  not direct handle_message callers). Reframed as "any unexpected
  super() exception".
- Kimi audited other parallel dicts (_last_message_id_per_room,
  _probe_msg_id_room_counts) and confirmed they don't need clearing
  — the first persists by design as a replay-dedup anchor; the
  second is debug-only and gated behind isEnabledFor(DEBUG).

Regression tests (tests/test_kimi.py PendingEnqueuedAtCleanupTests):
- test_cancel_background_tasks_clears_pending_enqueued_at: seeds the
  dict, calls the override, asserts empty.
- test_disconnect_clears_pending_enqueued_at: seeds the dict, calls
  disconnect with WS/HTTP/lock teardown patched, asserts empty.
- test_connect_clears_pending_enqueued_at: seeds a "stale" entry,
  calls connect with GetMe forced to fail, asserts the entry was
  cleared at session start (before the GetMe failure).
- test_handle_message_cleanup_runs_on_cancellation: patches
  super().handle_message to raise CancelledError, asserts the try/
  finally pops the timestamp via the guard.

No production behavior change for healthy connect/disconnect cycles
— this only matters for reconnects, error-recovery, and the rare
per-task cancellation outside of full adapter teardown.
diff --git a/kimi/kimi_adapter.py b/kimi/kimi_adapter.py
@@ -1095,6 +1095,15 @@ async def connect(self) -> bool:
         self._startup_ts = time.time()
         self._http_session = aiohttp.ClientSession()
 
+        # Belt-and-braces sweep of parallel TTL state at the start of
+        # every connect cycle. Standard teardown paths (disconnect,
+        # cancel_background_tasks) already clear this, but partial-init
+        # failures or future code paths that bypass them could leave
+        # stale entries from a prior session — and the gateway reuses
+        # this same adapter instance on reconnect. Clearing here
+        # guarantees every connect starts from a known-empty state.
+        self._pending_enqueued_at.clear()
+
         # Fetch bot identity once — needed to filter self-authored group messages.
         try:
             me = await self._rpc_unary("GetMe", {})
@@ -1161,6 +1170,63 @@ async def disconnect(self) -> None:
         await self._cleanup_http()
         self._release_platform_lock()
 
+        # Defense-in-depth: clear our parallel TTL state. The gateway's
+        # standard shutdown path calls cancel_background_tasks() before
+        # disconnect() (gateway/run.py:2725-2729), and that override
+        # already clears _pending_enqueued_at via super(). But other
+        # call sites — gateway/run.py:_safe_adapter_disconnect at
+        # ~line 953 + line 1145, plus the error-recovery branch at
+        # ~line 1110-1112 in connect() — call disconnect() directly
+        # without a prior drain. Clearing here ensures
+        # _pending_enqueued_at never outlives the connection,
+        # regardless of teardown path.
+        #
+        # Known limitation (out of scope for this fix): direct-
+        # disconnect paths don't clear the base class's
+        # ``_pending_messages`` or ``_active_sessions`` either — those
+        # only get cleaned up via cancel_background_tasks(). If a
+        # future code path reuses an adapter after a direct disconnect
+        # with real pending messages, those would also need clearing.
+        # The pre-existing behaviour is unchanged by this commit.
+        self._pending_enqueued_at.clear()
+
+    async def cancel_background_tasks(self) -> None:
+        """Mirror base behaviour for our parallel TTL state.
+
+        BasePlatformAdapter.cancel_background_tasks (gateway/platforms/
+        base.py:2553-2554) clears ``_pending_messages`` and
+        ``_active_sessions`` at the end of its drain. Our subclass
+        maintains a parallel ``_pending_enqueued_at`` dict that is
+        only meaningful while the corresponding ``_pending_messages``
+        slot is live; once base clears its state, our timestamps are
+        orphaned. Without this override they leak across reconnects
+        (the gateway typically reuses the adapter instance).
+
+        Correctness note: a stale-only timestamp is benign — the TTL
+        guard in ``handle_message`` keys off ``_pending_messages.get
+        (session_key)`` first (see line ~1247) and bails if no slot
+        exists, so a phantom ``_pending_enqueued_at`` entry can't
+        evict a real later message. The leak is a memory-hygiene
+        issue, not a correctness one — relevant for long-running pi
+        deployments that reconnect repeatedly over weeks.
+
+        Order: ``super()`` first, then our ``clear()``. Reversed,
+        an in-flight handler whose ``finally`` block runs during the
+        drain's ``await asyncio.gather`` could re-insert a key after
+        our clear, leaving us with a single stray entry per drain.
+        With this order, the base awaits all such handlers to
+        completion (their ``finally`` blocks see ``_pending_messages``
+        empty and pop their own timestamp via the guard), so our
+        clear is a final sweep over a known-empty dict.
+
+        Other parallel dicts on this adapter (``_last_message_id_per_
+        room`` for replay dedup, ``_probe_msg_id_room_counts`` for
+        debug counters) intentionally persist across reconnects or
+        carry no semantic state — they're not in scope here.
+        """
+        await super().cancel_background_tasks()
+        self._pending_enqueued_at.clear()
+
     async def _cleanup_http(self) -> None:
         if self._http_session is not None:
             try:
@@ -1247,13 +1313,32 @@ async def handle_message(self, event: MessageEvent) -> None:  # type: ignore[ove
             # with what super() puts in _pending_messages.
             self._pending_enqueued_at[session_key] = now
 
-        await super().handle_message(event)
-
-        # Clean up timestamp when the session finishes (slot consumed or
-        # not needed). Guard: only drop if the slot itself is gone, so a
-        # rapidly-arriving follow-up doesn't race-clear a fresh timestamp.
-        if session_key not in self._pending_messages:
-            self._pending_enqueued_at.pop(session_key, None)
+        try:
+            await super().handle_message(event)
+        finally:
+            # Clean up timestamp when the session finishes (slot consumed
+            # or not needed). Wrapped in finally so any unexpected
+            # exception from super() — most relevantly a CancelledError
+            # propagating up from base.handle_message itself — doesn't
+            # skip the cleanup and leak a timestamp into a future
+            # invocation. (Note: gateway/run.py's task-drain at
+            # cancel_background_tasks cancels ``_background_tasks`` —
+            # the spawned ``_process_message_background`` workers —
+            # not direct ``handle_message`` callers, so the
+            # cancellation pressure here is from other paths.)
+            #
+            # Why the guard is deterministic under cancellation: base
+            # writes to ``_pending_messages[session_key]`` happen
+            # synchronously with no ``await`` between the write and
+            # the function return (see gateway/platforms/base.py
+            # interrupt-queue path). So by the time our ``finally``
+            # observes ``_pending_messages``, the slot is in one of
+            # two known states: empty (no follow-up landed → safe to
+            # pop) or owned by a follow-up (write completed before
+            # our await unwound → preserve the fresh timestamp the
+            # follow-up's own pre-super block set).
+            if session_key not in self._pending_messages:
+                self._pending_enqueued_at.pop(session_key, None)
 
     # Public send / platform-surface overrides
     # ──────────────────────────────────────────────────────────────────────
diff --git a/tests/test_kimi.py b/tests/test_kimi.py
@@ -3187,6 +3187,108 @@ async def test_3a_4_ttl_enabled_evicts_expired_pending(self):
         self.assertEqual(len(overwrite_warnings), 0, "Eviction should not also fire a drop warning")
 
 
+# ═══════════════════════════════════════════════════════════════════════════════
+# Issue #33: _pending_enqueued_at cleanup across teardown paths
+#
+# The Kimi adapter maintains `_pending_enqueued_at` as a parallel TTL dict
+# alongside the base class's `_pending_messages` and `_active_sessions`. The
+# base clears the latter two during `cancel_background_tasks`; without parallel
+# clears in our subclass, the TTL dict leaks across reconnects (the gateway
+# reuses the adapter instance). The fix layers three guarantees: (1) the
+# `cancel_background_tasks` override mirrors the base's clear; (2) `disconnect`
+# also clears for direct-disconnect paths that bypass cancel_background_tasks
+# (gateway/run.py:_safe_adapter_disconnect, error-recovery in connect()); (3)
+# `handle_message`'s post-super cleanup is wrapped in `try/finally` so any
+# unexpected exception from super doesn't leak a stamped timestamp.
+# ═══════════════════════════════════════════════════════════════════════════════
+
+class PendingEnqueuedAtCleanupTests(unittest.IsolatedAsyncioTestCase):
+    """Issue #33: _pending_enqueued_at must be cleared across teardown paths."""
+
+    async def test_cancel_background_tasks_clears_pending_enqueued_at(self):
+        """cancel_background_tasks override must mirror the base's clear()
+        behaviour for our parallel TTL state."""
+        adapter = KimiAdapter(_cfg())
+        adapter._pending_enqueued_at["dm:test:abc"] = 1.0
+        adapter._pending_enqueued_at["room:xyz"] = 2.0
+        # super().cancel_background_tasks() walks _background_tasks (empty
+        # on a fresh adapter) and clears the base's parallel dicts; our
+        # override should clear _pending_enqueued_at on top of that.
+        await adapter.cancel_background_tasks()
+        self.assertEqual(
+            adapter._pending_enqueued_at, {},
+            "cancel_background_tasks should clear _pending_enqueued_at"
+        )
+
+    async def test_disconnect_clears_pending_enqueued_at(self):
+        """disconnect() must clear _pending_enqueued_at as defense-in-depth
+        for direct-disconnect call sites that bypass cancel_background_tasks."""
+        adapter = KimiAdapter(_cfg())
+        adapter._pending_enqueued_at["dm:test:abc"] = 1.0
+        adapter._pending_enqueued_at["room:xyz"] = 2.0
+        # Patch out network/lock teardown — the test only cares about the
+        # parallel-state clear; the rest is unrelated infrastructure.
+        with patch.object(adapter, "_cleanup_http", new=AsyncMock()), \
+             patch.object(adapter, "_release_platform_lock"):
+            await adapter.disconnect()
+        self.assertEqual(
+            adapter._pending_enqueued_at, {},
+            "disconnect should clear _pending_enqueued_at"
+        )
+
+    async def test_connect_clears_pending_enqueued_at(self):
+        """connect() must clear stale TTL state as a belt-and-braces sweep
+        before establishing a new session — protects against any path that
+        reuses the adapter without going through disconnect first."""
+        adapter = KimiAdapter(_cfg())
+        adapter._pending_enqueued_at["dm:stale:xyz"] = 99.0  # leftover from prior session
+
+        # connect() reaches the clear() before any network IO. Make connect
+        # short-circuit at GetMe so we don't have to mock the whole WS stack.
+        from kimi_adapter import KimiAuthError
+        with patch.object(adapter, "_acquire_platform_lock", return_value=True), \
+             patch.object(adapter, "_rpc_unary", new=AsyncMock(side_effect=KimiAuthError("test"))), \
+             patch.object(adapter, "_cleanup_http", new=AsyncMock()), \
+             patch.object(adapter, "_release_platform_lock"):
+            # Returns False because GetMe raises; the clear() ran before that.
+            result = await adapter.connect()
+            self.assertFalse(result, "connect() should fail when GetMe raises")
+        self.assertEqual(
+            adapter._pending_enqueued_at, {},
+            "connect should clear stale TTL state at session start"
+        )
+
+    async def test_handle_message_cleanup_runs_on_cancellation(self):
+        """try/finally ensures the post-super cleanup runs on CancelledError,
+        so a per-task cancellation outside of full adapter teardown doesn't
+        leak a stamped timestamp into the next handler invocation."""
+        adapter = KimiAdapter(_cfg())
+        event = _make_message_event("test message")
+        session_key = _compute_session_key(adapter, event)
+
+        # Simulate an active session so the override stamps _pending_enqueued_at.
+        adapter._active_sessions[session_key] = asyncio.Event()
+
+        # Patch super().handle_message to raise CancelledError mid-await, AFTER
+        # the override stamped its timestamp. _pending_messages is left empty
+        # (the mock does no enqueueing), so the cleanup guard's "if session_key
+        # not in _pending_messages" branch should fire.
+        with patch.object(
+            adapter.__class__.__bases__[0],
+            "handle_message",
+            new=AsyncMock(side_effect=asyncio.CancelledError),
+        ):
+            with self.assertRaises(asyncio.CancelledError):
+                await adapter.handle_message(event)
+
+        self.assertNotIn(
+            session_key,
+            adapter._pending_enqueued_at,
+            "try/finally should pop the timestamp on CancelledError when the "
+            "pending slot was never populated",
+        )
+
+
 # ═══════════════════════════════════════════════════════════════════════════════
 # Lift 3b: output_mode flag
 # ═══════════════════════════════════════════════════════════════════════════════