Skip to content

fix(tui): exit notification poller when session removed from _sessions#44801

Open
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/notification-poller-session-exit
Open

fix(tui): exit notification poller when session removed from _sessions#44801
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/notification-poller-session-exit

Conversation

@liuhao1024

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a sid in _sessions check to the notification poller loop in _notification_poller_loop, so the poller thread automatically exits when its session is removed from the global _sessions dict. This fixes a flaky test caused by leaked poller threads from earlier tests racing on the shared completion_queue.

Related Issue

Fixes #44789

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • tui_gateway/server.py: Added sid in _sessions to the while condition in _notification_poller_loop (+1/-1 line). The poller now exits when the session is removed, making _sessions.pop(sid) sufficient for cleanup — no explicit thread join needed.
  • tests/test_notification_poller_session_exit.py: Added 2 regression tests — one verifying the poller exits when the session is removed, and one verifying it stays alive while the session is present.

How to Test

  1. Run pytest tests/test_notification_poller_session_exit.py -v — both tests should pass
  2. Run pytest tests/test_tui_gateway_server.py -k "notification_poller" -v — all 4 existing tests should still pass
  3. To reproduce the original flake, run pytest tests/test_tui_gateway_server.py::test_notification_poller_emits_distinct_watch_matches_once multiple times in sequence (without this fix, occasional assert 3 == 2 failures occur)

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Code Intelligence

  • Analyzed: _notification_poller_loop in tui_gateway/server.py (callers: _start_notification_poller, tests)
  • Blast radius: LOW — single while-condition change in TUI notification poller; no gateway/CLI code affected
  • Related patterns: Issue reporter identified the server-side approach as one of two valid fixes (the other being test-side cleanup, which is test(tui): stop notification pollers in session cleanup #34455). Our fix also reaps dead-session pollers in production, not just in tests.

The notification poller loop only checked stop_event and session
_finalized, so removing a session from _sessions (e.g., during test
cleanup or when a TUI session dies) left the poller thread running
indefinitely. In CI, leaked poller threads from earlier tests raced
on the shared completion_queue, causing flaky failures in
test_notification_poller_emits_distinct_watch_matches_once.

Adding `sid in _sessions` to the while condition makes _sessions.pop()
sufficient to reap the poller — no explicit thread join needed. This
also reaps pollers for dead sessions in production.

@AIalliAI AIalliAI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified against the diagnosis in #44789 — this implements the server-side option correctly, and the tests are genuine.

Why it fixes the flake: the leaked pollers in the CI failure came from tests that create real sessions through the handler paths (session.create/resume/branch → _init_session_start_notification_poller) and clean up with server._sessions.pop(sid) / _sessions.clear(), which under the old condition never stopped the thread. With sid in _sessions in the loop condition, every such thread exits within one get(timeout=0.5) iteration of its test's teardown, so nothing is left alive to race test_notification_poller_emits_distinct_watch_matches_once on the shared completion_queue.

Safety check: both _start_notification_poller call sites are guarded by with _sessions_lock: if sid in _sessions:, so a poller can't start before its sid is registered — no premature-exit race. The only production removal path (_close_session_by_id) finalizes immediately after popping, so the new exit condition just moves the exit a hair earlier on the same teardown. The four existing poller tests all register their sid in _sessions before invoking the loop, so they're unaffected.

Negative control: without the code change, test_poller_exits_when_session_removed_from_sessions fails on main — the loop condition never goes false, t.join(timeout=2.0) times out, and assert not t.is_alive() trips. Good regression test.

Two minor notes, neither blocking:

  1. In the new exit test, stop.set() sits after the assert — if the assert ever fails on a future regression, the test leaks the thread it's testing for. Worth a try/finally.
  2. A thread already blocked in get(timeout=0.5) at pop time can still consume and emit one final event (and the post-loop drain pass runs on exit), so there's a ≤0.5s residual window after a session is popped. That's inherent to the loop structure (the test-side approach in #34455 has the same property) and doesn't affect the flake in practice.

This and #34455 (test-side cleanup of the same leak) are complementary; maintainers may want to decide whether to land both or consolidate.

@alt-glitch alt-glitch added type/bug Something isn't working comp/tui Terminal UI (ui-tui/ + tui_gateway/) P3 Low — cosmetic, nice to have labels Jun 12, 2026
AIalliAI pushed a commit to AIalliAI/Hermes that referenced this pull request Jun 12, 2026
Cherry-pick of upstream PR NousResearch#44801 (liuhao1024) — fixes the
test_notification_poller_emits_distinct_watch_matches_once CI flake
(our issue NousResearch#44789): poller threads leaked by earlier tests' _sessions.pop
teardowns kept consuming the global completion_queue. Drop when NousResearch#44801
merges upstream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tui Terminal UI (ui-tui/ + tui_gateway/) P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: notification-poller dedup test fails when poller threads leaked by earlier tests race on the global completion_queue

3 participants