fix(tui): exit notification poller when session removed from _sessions#44801
fix(tui): exit notification poller when session removed from _sessions#44801liuhao1024 wants to merge 1 commit into
Conversation
The notification poller loop only checked stop_event and session _finalized, so removing a session from _sessions (e.g., during test cleanup or when a TUI session dies) left the poller thread running indefinitely. In CI, leaked poller threads from earlier tests raced on the shared completion_queue, causing flaky failures in test_notification_poller_emits_distinct_watch_matches_once. Adding `sid in _sessions` to the while condition makes _sessions.pop() sufficient to reap the poller — no explicit thread join needed. This also reaps pollers for dead sessions in production.
AIalliAI
left a comment
There was a problem hiding this comment.
Verified against the diagnosis in #44789 — this implements the server-side option correctly, and the tests are genuine.
Why it fixes the flake: the leaked pollers in the CI failure came from tests that create real sessions through the handler paths (session.create/resume/branch → _init_session → _start_notification_poller) and clean up with server._sessions.pop(sid) / _sessions.clear(), which under the old condition never stopped the thread. With sid in _sessions in the loop condition, every such thread exits within one get(timeout=0.5) iteration of its test's teardown, so nothing is left alive to race test_notification_poller_emits_distinct_watch_matches_once on the shared completion_queue.
Safety check: both _start_notification_poller call sites are guarded by with _sessions_lock: if sid in _sessions:, so a poller can't start before its sid is registered — no premature-exit race. The only production removal path (_close_session_by_id) finalizes immediately after popping, so the new exit condition just moves the exit a hair earlier on the same teardown. The four existing poller tests all register their sid in _sessions before invoking the loop, so they're unaffected.
Negative control: without the code change, test_poller_exits_when_session_removed_from_sessions fails on main — the loop condition never goes false, t.join(timeout=2.0) times out, and assert not t.is_alive() trips. Good regression test.
Two minor notes, neither blocking:
- In the new exit test,
stop.set()sits after the assert — if the assert ever fails on a future regression, the test leaks the thread it's testing for. Worth atry/finally. - A thread already blocked in
get(timeout=0.5)at pop time can still consume and emit one final event (and the post-loop drain pass runs on exit), so there's a ≤0.5s residual window after a session is popped. That's inherent to the loop structure (the test-side approach in #34455 has the same property) and doesn't affect the flake in practice.
This and #34455 (test-side cleanup of the same leak) are complementary; maintainers may want to decide whether to land both or consolidate.
Cherry-pick of upstream PR NousResearch#44801 (liuhao1024) — fixes the test_notification_poller_emits_distinct_watch_matches_once CI flake (our issue NousResearch#44789): poller threads leaked by earlier tests' _sessions.pop teardowns kept consuming the global completion_queue. Drop when NousResearch#44801 merges upstream.
What does this PR do?
Adds a
sid in _sessionscheck to the notification poller loop in_notification_poller_loop, so the poller thread automatically exits when its session is removed from the global_sessionsdict. This fixes a flaky test caused by leaked poller threads from earlier tests racing on the sharedcompletion_queue.Related Issue
Fixes #44789
Type of Change
Changes Made
tui_gateway/server.py: Addedsid in _sessionsto thewhilecondition in_notification_poller_loop(+1/-1 line). The poller now exits when the session is removed, making_sessions.pop(sid)sufficient for cleanup — no explicit thread join needed.tests/test_notification_poller_session_exit.py: Added 2 regression tests — one verifying the poller exits when the session is removed, and one verifying it stays alive while the session is present.How to Test
pytest tests/test_notification_poller_session_exit.py -v— both tests should passpytest tests/test_tui_gateway_server.py -k "notification_poller" -v— all 4 existing tests should still passpytest tests/test_tui_gateway_server.py::test_notification_poller_emits_distinct_watch_matches_oncemultiple times in sequence (without this fix, occasionalassert 3 == 2failures occur)Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/ACode Intelligence
_notification_poller_loopintui_gateway/server.py(callers:_start_notification_poller, tests)