Skip to content

fix(gateway): exclude restart requester from drain#21036

Open
LeonSGP43 wants to merge 3 commits into
NousResearch:mainfrom
LeonSGP43:fix/gateway-restart-exclude-caller-20694
Open

fix(gateway): exclude restart requester from drain#21036
LeonSGP43 wants to merge 3 commits into
NousResearch:mainfrom
LeonSGP43:fix/gateway-restart-exclude-caller-20694

Conversation

@LeonSGP43

Copy link
Copy Markdown
Contributor

Summary

Verification

  • scripts/run_tests.sh tests/gateway/test_gateway_shutdown.py tests/gateway/test_restart_redelivery_dedup.py tests/gateway/test_session_race_guard.py -> 37 passed
  • python -m py_compile gateway/run.py tests/gateway/test_gateway_shutdown.py tests/gateway/restart_test_helpers.py
  • git diff --check

Overlap check

Searched current PRs for #20694/restart requester drain overlap. Existing restart PRs are adjacent, but no open PR covers excluding the requester session from the drain.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 7, 2026
@Bartok9

Bartok9 commented May 7, 2026

Copy link
Copy Markdown
Contributor

Reviewed the approach. The fix is minimal and correct:

  1. _restart_drain_exclude_session_key stores the requester's session key at the start of restart
  2. _running_agent_count() and _snapshot_running_agents() both skip the excluded key
  3. The drain loop in _graceful_drain now uses active_keys (the snapshot keys, without the requester) rather than self._running_agents directly, so the drain correctly tracks only the non-requester sessions
  4. The exclude key is cleared after drain completes (the runner._restart_drain_exclude_session_key = None in the test helper implies this is also done in the live path)

One thing to verify: is _restart_drain_exclude_session_key explicitly cleared in the live restart path after drain? If the gateway restarts successfully, the value is irrelevant; but if restart is cancelled or fails, a stale value could cause a subsequent /restart from a different session to incorrectly exclude the original session key. Based on the PR code this looks OK — exclude_session_key is set before _graceful_drain and the object is reloaded after restart. Just confirming the cleanup path is covered. ✓

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@LeonSGP43

LeonSGP43 commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Status: pushed follow-up repair for the PR-caused Telegram restart e2e failure.

  • New head: 8d9d8e9ca3857365be813f0ba88588743f167aa4 on fix/gateway-restart-exclude-caller-20694.
  • Change scope: updated tests/e2e/test_platform_commands.py so the plaintext Telegram restart expectation includes the new exclude_session_key kwarg introduced by the restart-drain safety change.
  • Local verification: git diff --check origin/main...HEAD; uv run --python 3.12 --with pytest-xdist --with pytest-asyncio pytest tests/e2e/test_platform_commands.py -k 'test_plaintext_restart_gateway_routes_to_safe_restart_command and telegram' tests/gateway/test_gateway_shutdown.py -k restart -q => 5 passed, 4 skipped.
  • GitHub Actions reran automatically on the new head and are currently queued/in progress.

@LeonSGP43

Copy link
Copy Markdown
Contributor Author

Status update for #21036 at head 8d9d8e9ca3857365be813f0ba88588743f167aa4.

This PR still addresses #20694: prevent a self-issued /restart from waiting on its own in-flight session during drain, while keeping other active sessions in the normal graceful-drain path.

Current verification / classification:

  • Local verification run on this head:
    • git diff --check origin/main...HEAD
    • uv run --python 3.12 --with pytest-xdist --with pytest-asyncio pytest tests/e2e/test_platform_commands.py -k 'test_plaintext_restart_gateway_routes_to_safe_restart_command and telegram' tests/gateway/test_gateway_shutdown.py -k restart -q
    • Result: 5 passed, 4 skipped
  • Tests / e2e: green on this head.
  • Tests / test: still red, and the red signal is mixed:
    • Shared main baseline: tests/gateway/test_restart_drain.py::test_restart_command_while_busy_requests_drain_without_interrupt is also failing on main in run 25470266872.
    • Remaining PR-caused failures on this branch: tests/gateway/test_restart_notification.py::test_restart_command_uses_service_restart_under_systemd, tests/gateway/test_restart_notification.py::test_restart_command_uses_detached_without_systemd, and tests/gateway/test_gateway_shutdown.py::test_restart_command_excludes_requesting_session_from_drain are still asserting the pre-exclude_session_key restart call shape / behavior.
  • Lint (ruff + ty) / ruff + ty diff: the workflow reached report generation, then failed while trying to post/update its PR comment with GitHub 403 Resource not accessible by integration. That conclusion is an external workflow-permission failure, not a branch lint failure.

So the Telegram e2e regression is repaired, but this PR still needs a small follow-up test update for the remaining restart-related unit assertions before it can be treated as baseline/external-only red.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway self-restart from active WhatsApp chat can self-block graceful drain and always hit drain timeout

3 participants